1 About

This project is the final group project of Applied Statistics with R taught by Professor Kostis Christodoulou at London Business School.

My collaborators are Alessandro Angeletti, Nitya Chopra, Johanna Jeffery, and Christopher Lewis. (Hail Group 13!)

The following journey will take a long time. To have a glance at our final findings, you can download our presentation deck here.

2 Executive Summary

The purpose of this report is to produce a regression model which predicts the cost of a 4 night stay, for 2 people, in an Airbnb in Beijing. In order to do this, we progressed through four stages - firstly background research on the Beijing Airbnb market followed by Exploratory Data Analysis (EDA). During EDA, we viewed, cleaned, wrangled and visualised our data (specifically variability in the different independent variables and geographical mapping). Following this, we moved onto the third stage where we tested out different combinations of regressors and their functional forms to achieve a final model with the highest possible explanatory power. The process was iterative; we evaluated different variables on the basis of their t-stat value, marginal improvement in adjusted R-squared and residual standard error.

Having decided our final model with an explanatory power of 54.4%, we generated imaginary Airbnb listings with some common base characteristics such as property type “Apartment”, room type “Private”, etc. and predicted the price for a 4 night stay. In addition, we varied certain characteristics such as location, amenities, superhost status to demonstrate how price varies quite significantly as we change the values of these regressors.

3 Background: Airbnb in Beijing

The concept of homestays first appeared in China in 2011, when the concept of sharing economy began to spread to China. After 8-9 years of market education, today, this concept of sharing economy is deeply rooted in the hearts of Chinese people. Dwelled in this economy, people became more willing to utilize their spare homes and join the homestay host ranks. Meanwhile, various tourism policies in China have mentioned encouraging the development of characteristic homestays since 2015 and will continue to favor the homestay industry in the next 3-5 years. In the future, the government will keep encouraging the effective use of personal idle properties and support the development of homestays.

The economy reshaped not only people’s home usage preference but also their travel accommodation preference. Homestay has turned from hotel’s complement to substitution, with its great cost performance, various modern interior design, and high suitability for family trips. The current distribution of homestays is consistent with the overall development of China’s tourism industry. Homestays concentrate in areas where the tourism industry is relatively developed, like the East and South of China. Beijing has dominated the listing rank with over 3500 listings (2018, Homestay Investment and Investment, China Commercial Industry Research Institute).

Currently, Airbnb is one of the major players in the Chinese B2B homestay industry. It is advantaged by its international identity, reaching 110% and 250% respective growth rate of outbound travel through Airbnb and the number of people staying in domestic listings in China. However, it is also encountering challenges from local competitors in the battle of localizing and meeting the demand of the sinking market–the new focal point of all industries.

4 Exploratory Data Analysis

First we have to download the data.

data <- vroom::vroom("listings.csv.gz") %>% 
  clean_names()

4.1 Raw Data Exploration

# Let's have a look at what we're dealing with!
glimpse(data)
## Rows: 36,283
## Columns: 106
## $ id                                           <dbl> 44054, 100213, 114384, 1…
## $ listing_url                                  <chr> "https://www.airbnb.com/…
## $ scrape_id                                    <dbl> 2.02e+13, 2.02e+13, 2.02…
## $ last_scraped                                 <date> 2020-06-20, 2020-06-20,…
## $ name                                         <chr> "Modern and Comfortable …
## $ summary                                      <chr> "East Apartments offers …
## $ space                                        <chr> "East Apartments is a we…
## $ description                                  <chr> "East Apartments offers …
## $ experiences_offered                          <chr> "none", "none", "none", …
## $ neighborhood_overview                        <chr> "The neighborhood is a p…
## $ notes                                        <chr> "*For long term reservat…
## $ transit                                      <chr> "The easiest method to g…
## $ access                                       <chr> "*Guests have access to …
## $ interaction                                  <chr> NA, NA, "Helen和Wendy会全程为…
## $ house_rules                                  <chr> "Registration All guests…
## $ thumbnail_url                                <lgl> NA, NA, NA, NA, NA, NA, …
## $ medium_url                                   <lgl> NA, NA, NA, NA, NA, NA, …
## $ picture_url                                  <chr> "https://a0.muscache.com…
## $ xl_picture_url                               <lgl> NA, NA, NA, NA, NA, NA, …
## $ host_id                                      <dbl> 192875, 527062, 533062, …
## $ host_url                                     <chr> "https://www.airbnb.com/…
## $ host_name                                    <chr> "East Apartments", "Joe"…
## $ host_since                                   <date> 2010-08-06, 2011-04-22,…
## $ host_location                                <chr> "Beijing, Beijing, China…
## $ host_about                                   <chr> "Hi everyone!  My name i…
## $ host_response_time                           <chr> "within an hour", "N/A",…
## $ host_response_rate                           <chr> "100%", "N/A", "100%", "…
## $ host_acceptance_rate                         <chr> "95%", "N/A", "100%", "1…
## $ host_is_superhost                            <lgl> FALSE, FALSE, FALSE, FAL…
## $ host_thumbnail_url                           <chr> "https://a0.muscache.com…
## $ host_picture_url                             <chr> "https://a0.muscache.com…
## $ host_neighbourhood                           <chr> "Shuangjing", NA, "ITC",…
## $ host_listings_count                          <dbl> 5, 4, 5, 5, 1, 7, 7, 6, …
## $ host_total_listings_count                    <dbl> 5, 4, 5, 5, 1, 7, 7, 6, …
## $ host_verifications                           <chr> "['email', 'phone', 'fac…
## $ host_has_profile_pic                         <lgl> TRUE, TRUE, TRUE, TRUE, …
## $ host_identity_verified                       <lgl> FALSE, FALSE, FALSE, FAL…
## $ street                                       <chr> "Beijing, Beijing, China…
## $ neighbourhood                                <chr> "Chaoyang", NA, "ITC", "…
## $ neighbourhood_cleansed                       <chr> "朝阳区 / Chaoyang", "密云县 /…
## $ neighbourhood_group_cleansed                 <lgl> NA, NA, NA, NA, NA, NA, …
## $ city                                         <chr> "Beijing", "Beijing", "B…
## $ state                                        <chr> "Beijing", "Beijing", "B…
## $ zipcode                                      <dbl> 100022, 101508, NA, 1000…
## $ market                                       <chr> "Beijing", "Other (Inter…
## $ smart_location                               <chr> "Beijing, China", "Beiji…
## $ country_code                                 <chr> "CN", "CN", "CN", "CN", …
## $ country                                      <chr> "China", "China", "China…
## $ latitude                                     <dbl> 39.9, 40.7, 39.9, 39.9, …
## $ longitude                                    <dbl> 116, 117, 116, 116, 116,…
## $ is_location_exact                            <lgl> TRUE, TRUE, TRUE, FALSE,…
## $ property_type                                <chr> "Serviced apartment", "G…
## $ room_type                                    <chr> "Entire home/apt", "Priv…
## $ accommodates                                 <dbl> 9, 2, 2, 2, 3, 2, 4, 2, …
## $ bathrooms                                    <dbl> 2, 1, 1, 1, 1, 1, 1, 1, …
## $ bedrooms                                     <dbl> 3, 1, 1, 1, 1, 1, 1, 1, …
## $ beds                                         <dbl> 4, 1, 1, 1, 2, 1, 2, 1, …
## $ bed_type                                     <chr> "Real Bed", "Real Bed", …
## $ amenities                                    <chr> "{TV,\"Cable TV\",Intern…
## $ square_feet                                  <dbl> 1464, NA, NA, NA, 323, N…
## $ price                                        <chr> "$835.00", "$1,203.00", …
## $ weekly_price                                 <chr> "$8,373.00", "$7,200.00"…
## $ monthly_price                                <chr> "$27,603.00", "$28,800.0…
## $ security_deposit                             <chr> "$708.00", "$0.00", NA, …
## $ cleaning_fee                                 <chr> "$71.00", "$0.00", NA, "…
## $ guests_included                              <dbl> 6, 1, 1, 1, 2, 1, 1, 2, …
## $ extra_people                                 <chr> "$71.00", "$0.00", "$0.0…
## $ minimum_nights                               <dbl> 2, 1, 1, 1, 3, 1, 1, 1, …
## $ maximum_nights                               <dbl> 365, 30, 730, 1125, 365,…
## $ minimum_minimum_nights                       <dbl> 2, 1, 1, 1, 3, 1, 1, 1, …
## $ maximum_minimum_nights                       <dbl> 2, 1, 1, 1, 3, 1, 1, 1, …
## $ minimum_maximum_nights                       <dbl> 365, 30, 730, 1125, 365,…
## $ maximum_maximum_nights                       <dbl> 365, 30, 730, 1125, 365,…
## $ minimum_nights_avg_ntm                       <dbl> 2, 1, 1, 1, 3, 1, 1, 1, …
## $ maximum_nights_avg_ntm                       <dbl> 365, 30, 730, 1125, 365,…
## $ calendar_updated                             <chr> "5 months ago", "27 mont…
## $ has_availability                             <lgl> TRUE, TRUE, TRUE, TRUE, …
## $ availability_30                              <dbl> 19, 0, 19, 19, 19, 2, 0,…
## $ availability_60                              <dbl> 49, 0, 49, 49, 49, 2, 0,…
## $ availability_90                              <dbl> 79, 0, 79, 79, 79, 2, 0,…
## $ availability_365                             <dbl> 354, 0, 354, 354, 169, 2…
## $ calendar_last_scraped                        <date> 2020-06-20, 2020-06-20,…
## $ number_of_reviews                            <dbl> 99, 2, 66, 10, 290, 26, …
## $ number_of_reviews_ltm                        <dbl> 7, 0, 1, 1, 22, 0, 2, 0,…
## $ first_review                                 <date> 2010-08-25, 2017-08-27,…
## $ last_review                                  <date> 2020-01-06, 2017-10-08,…
## $ review_scores_rating                         <dbl> 91, 100, 93, 98, 97, 77,…
## $ review_scores_accuracy                       <dbl> 9, 10, 10, 9, 10, 8, 8, …
## $ review_scores_cleanliness                    <dbl> 8, 9, 9, 9, 10, 7, 7, 8,…
## $ review_scores_checkin                        <dbl> 10, 10, 10, 9, 10, 9, 9,…
## $ review_scores_communication                  <dbl> 10, 10, 10, 10, 10, 9, 9…
## $ review_scores_location                       <dbl> 10, 9, 10, 10, 10, 9, 9,…
## $ review_scores_value                          <dbl> 9, 9, 10, 9, 10, 8, 9, 8…
## $ requires_license                             <lgl> FALSE, FALSE, FALSE, FAL…
## $ license                                      <chr> NA, NA, "Exempt", "Exemp…
## $ jurisdiction_names                           <lgl> NA, NA, NA, NA, NA, NA, …
## $ instant_bookable                             <lgl> FALSE, TRUE, TRUE, TRUE,…
## $ is_business_travel_ready                     <lgl> FALSE, FALSE, FALSE, FAL…
## $ cancellation_policy                          <chr> "strict_14_with_grace_pe…
## $ require_guest_profile_picture                <lgl> FALSE, FALSE, FALSE, FAL…
## $ require_guest_phone_verification             <lgl> FALSE, FALSE, FALSE, FAL…
## $ calculated_host_listings_count               <dbl> 5, 4, 5, 5, 1, 5, 5, 6, …
## $ calculated_host_listings_count_entire_homes  <dbl> 5, 0, 5, 5, 1, 5, 5, 5, …
## $ calculated_host_listings_count_private_rooms <dbl> 0, 3, 0, 0, 0, 0, 0, 1, …
## $ calculated_host_listings_count_shared_rooms  <dbl> 0, 1, 0, 0, 0, 0, 0, 0, …
## $ reviews_per_month                            <dbl> 0.83, 0.06, 0.73, 0.11, …

From this output we can see that we have

  • Just over 36 thousand observations (or Airbnb listings) in Beijing in the data set;
  • 106 different variables included in the data;
  • These variables are a mixture of ‘double’, ‘character’, ‘logic’ and ‘date’;
  • straightaway we can see that some of our ‘price’ variables include dollar signs ($) and are down as ‘character’ variables rather than ‘double’ variables; and
  • That there are many, MANY, NA’s

Since this is a large data set with a lot going on, we will first select the variables we’re interested. Successively, we will also reformat them to ensure that there are no special characters such as ‘$’ or ‘%’.

4.2 Summary Statistics and Missing Values

  listings <- data %>% 
  
  #Lets pick the variables we need
  select(c(price,
           cleaning_fee,
           extra_people,
           room_type,
           property_type,
           number_of_reviews,
           review_scores_rating,
           longitude,
           latitude,
           neighbourhood,
           minimum_nights,
           guests_included,
           bathrooms,
           bedrooms,
           beds,
           accommodates,
           host_is_superhost,
           neighbourhood_cleansed,
           cancellation_policy,
           listing_url,
           is_location_exact,
           security_deposit,
           review_scores_cleanliness,
           instant_bookable,
           amenities,
           calculated_host_listings_count,
           reviews_per_month,
           host_acceptance_rate
           )
         ) %>% 

  # Removing dollar signs and changing into numerical variables
  
  mutate(
 
    # Changing Price from chr to dbl
    
    price = parse_number(price),
    
    # Changing Cleaning Fee from chr to dbl
    
    cleaning_fee = parse_number(cleaning_fee),
    
    # Changing Extra People fee from chr to dbl
    
    extra_people = parse_number(extra_people),
    
    # Changing Security Deposit from chr to dbl
    
    security_deposit = parse_number(security_deposit),
    
    # Changing host acceptance rate 
    
    host_acceptance_rate = parse_number(host_acceptance_rate)/100
  )

Now that we have all the variables in the format required, we wish to check the quality of our data by investigating some of the variables key characteristics.

4.2.1 Removing Missing Values

# Check which variables have lots of missing values (NA's)
listings %>% 
  skim() %>% 
  kbl() %>% 
  kable_styling()
skim_type skim_variable n_missing complete_rate character.min character.max character.empty character.n_unique character.whitespace logical.mean logical.count numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
character room_type 0 1.000 11 15 0 3 0 NA NA NA NA NA NA NA NA NA NA
character property_type 0 1.000 3 22 0 45 0 NA NA NA NA NA NA NA NA NA NA
character neighbourhood 13370 0.632 3 36 0 61 0 NA NA NA NA NA NA NA NA NA NA
character neighbourhood_cleansed 0 1.000 3 16 0 16 0 NA NA NA NA NA NA NA NA NA NA
character cancellation_policy 0 1.000 8 27 0 3 0 NA NA NA NA NA NA NA NA NA NA
character listing_url 0 1.000 34 37 0 36283 0 NA NA NA NA NA NA NA NA NA NA
character amenities 0 1.000 2 1917 0 28222 0 NA NA NA NA NA NA NA NA NA NA
logical host_is_superhost 1 1.000 NA NA NA NA NA 0.264 FAL: 26711, TRU: 9571 NA NA NA NA NA NA NA NA
logical is_location_exact 0 1.000 NA NA NA NA NA 0.565 TRU: 20497, FAL: 15786 NA NA NA NA NA NA NA NA
logical instant_bookable 0 1.000 NA NA NA NA NA 0.643 TRU: 23333, FAL: 12950 NA NA NA NA NA NA NA NA
numeric price 0 1.000 NA NA NA NA NA NA NA 726.046 1861.040 0.00 255.00 396.00 651.00 70723.0 ▇▁▁▁▁
numeric cleaning_fee 23123 0.363 NA NA NA NA NA NA NA 60.943 218.669 0.00 0.00 40.00 70.00 10000.0 ▇▁▁▁▁
numeric extra_people 0 1.000 NA NA NA NA NA NA NA 20.474 79.101 0.00 0.00 0.00 0.00 2118.0 ▇▁▁▁▁
numeric number_of_reviews 0 1.000 NA NA NA NA NA NA NA 6.752 16.834 0.00 0.00 1.00 5.00 344.0 ▇▁▁▁▁
numeric review_scores_rating 16270 0.552 NA NA NA NA NA NA NA 94.789 10.836 20.00 94.00 100.00 100.00 100.0 ▁▁▁▁▇
numeric longitude 0 1.000 NA NA NA NA NA NA NA 116.442 0.258 115.47 116.34 116.43 116.50 117.5 ▁▁▇▁▁
numeric latitude 0 1.000 NA NA NA NA NA NA NA 40.022 0.235 39.46 39.90 39.94 40.05 41.0 ▁▇▁▂▁
numeric minimum_nights 0 1.000 NA NA NA NA NA NA NA 4.308 28.307 1.00 1.00 1.00 1.00 1086.0 ▇▁▁▁▁
numeric guests_included 0 1.000 NA NA NA NA NA NA NA 1.365 1.257 1.00 1.00 1.00 1.00 16.0 ▇▁▁▁▁
numeric bathrooms 21 0.999 NA NA NA NA NA NA NA 1.424 1.375 0.00 1.00 1.00 1.50 101.5 ▇▁▁▁▁
numeric bedrooms 142 0.996 NA NA NA NA NA NA NA 1.663 1.480 0.00 1.00 1.00 2.00 50.0 ▇▁▁▁▁
numeric beds 380 0.990 NA NA NA NA NA NA NA 2.242 2.754 0.00 1.00 1.00 2.00 115.0 ▇▁▁▁▁
numeric accommodates 0 1.000 NA NA NA NA NA NA NA 3.742 3.090 1.00 2.00 2.00 4.00 18.0 ▇▁▁▁▁
numeric security_deposit 23793 0.344 NA NA NA NA NA NA NA 655.045 2337.306 0.00 0.00 200.00 700.00 35362.0 ▇▁▁▁▁
numeric review_scores_cleanliness 16272 0.552 NA NA NA NA NA NA NA 9.518 1.065 2.00 9.00 10.00 10.00 10.0 ▁▁▁▁▇
numeric calculated_host_listings_count 0 1.000 NA NA NA NA NA NA NA 9.543 13.636 1.00 2.00 5.00 11.00 89.0 ▇▁▁▁▁
numeric reviews_per_month 15644 0.569 NA NA NA NA NA NA NA 0.649 0.850 0.01 0.14 0.31 0.81 22.9 ▇▁▁▁▁
numeric host_acceptance_rate 6280 0.827 NA NA NA NA NA NA NA 0.922 0.189 0.00 0.95 1.00 1.00 1.0 ▁▁▁▁▇

Surprises in the data

  1. Here we can see that cleaning_fee has an extremely high number of missing values or NA values. This is most likely due to some properties including a cleaning fee within the price, and thus look cheaper when you’re looking to book as there aren’t any “add-on” costs. Interestingly, however, we note how some properties do include this and the cleaning costs can vary widely as they range between $0 to over $10,000! Therefore, we will have to look at how the cleaning fee variable correlated with other characteristics of the listings (such as the flat size / number of bed rooms / number of guests / etc.)
  2. When judging the level of activity of Airbnb Beijing, we note how this platform still had room to grow as it isn’t the giant we might assume. When investigating the total number of listings, we find how Beijing (and China) is no where close to the top 10 trending cities (or countries) on Airbnb. This becomes all the more obvious when we notice how each property has on average only 6.7 reviews!
  3. When looking at the size of properties, we note how the median (and not average due to several outliers) properties has only 1 bedroom and 1 bathroom. This indicates how typically, most rentals will be limited to one couple (and likely without kids), as confirmed by the median number of guests allowed (just 2 people).

In this next section of code, we tackle the ‘NA’ values in cleaning_fee, security_deposit and reviews_per_month and also transform the amenities variable into a format we can use for our model.

data_cleaned <- listings %>% 
  
# In order to handle the high volume of NA's in cleaning_fee, we will change these values to a 0
  
  mutate(
    cleaning_fee = case_when(
      is.na(cleaning_fee) ~ 0,
      TRUE ~ cleaning_fee
        ),
    
# We apply the same logic to the security_deposit variable
  
    security_deposit = case_when(
      is.na(security_deposit) ~ 0,
      TRUE ~ security_deposit
        ),
  
# and again to the reviews_per_month variable
  
    reviews_per_month = case_when(
      is.na(reviews_per_month) ~ 0,
      TRUE ~ reviews_per_month
        ),

# Creating a new variable 'wifi' which returns as TRUE when wifi is detected in the variable 'amenities'

    wifi = case_when(
      str_detect(amenities, "Wifi") ~ TRUE,
    
# Allowing for differences in spelling and upper/lowercases 
    
      str_detect(amenities, "wifi") ~ TRUE,
      TRUE ~ FALSE
      ),

# Process repeated again to create a 'breakfast' variable
  
    breakfast = case_when(
      str_detect(amenities, "Breakfast") ~ TRUE,
      str_detect(amenities, "breakfast") ~ TRUE,
      TRUE ~ FALSE
      ),

# We are counting the number of amenities available at each property by counting the number of "," (commas) in the string.

    services = sapply(strsplit(listings$amenities, ","), length),
    host_acceptance_rate = case_when(
      is.na(host_acceptance_rate) ~ 0,
      TRUE ~ host_acceptance_rate
      )
  )

# lets examine wifi and breakfast columns
data_cleaned %>% 
  select(c(price, wifi, breakfast))
## # A tibble: 36,283 x 3
##    price wifi  breakfast
##    <dbl> <lgl> <lgl>    
##  1   835 TRUE  FALSE    
##  2  1203 TRUE  TRUE     
##  3   602 TRUE  FALSE    
##  4   602 TRUE  FALSE    
##  5   411 TRUE  TRUE     
##  6   552 TRUE  FALSE    
##  7   601 TRUE  FALSE    
##  8   403 TRUE  FALSE    
##  9   743 TRUE  FALSE    
## 10   418 TRUE  FALSE    
## # … with 36,273 more rows
# Let's skim the cleaning_fee variable to see if we have succeeded
data_cleaned %>% 
skim(cleaning_fee) %>% 
  # the kable package is used to format the resulting tables in a more visually appealing way
  kbl() %>% 
  kable_styling()
skim_type skim_variable n_missing complete_rate numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
numeric cleaning_fee 0 1 22.1 135 0 0 0 0 10000 ▇▁▁▁▁

Fun facts!

Having an additional look at the data, we can note the additional stats:

  1. The average size of apartments is of only 56m^2;
  2. The most expensive listing is of $70,723 per night;
  3. One of the listings has 101.5 bathrooms;
  4. Approximately 40% of flats are apartments; and
  5. One of the listings claims to be an igloo.

4.3 Visualising The Data

Summary statistics can only take us so far to understanding the data, so it is important to also visualise our variables.

4.3.1 Numerical Data

# Using patchwork to create a visualization of density for all numerical variables
p1 <- ggplot(data = data_cleaned, aes(x = price)) +
  geom_density() +
  theme_bw() + 
  labs(title = "Variability of Price",
       subtitle="Difficulty interpreting density due to outliers in price") +
  theme(
    plot.title = element_text(face="bold")
  )

# Before creating plots for all other numerical variables, let's check the readability
p1

#Some of the x-axis for the data mean that it is difficult to get a full picture 
#of the variability in the variables

p1a <- ggplot(data = data_cleaned, aes(x = price)) +
  geom_density() +
  
#Here we add a limit to the x-axis, where the maximum value is 10000. 
#We add this to most of the plots, where necessary
  
  xlim(0, 10000) +
  theme_bw() +
  labs(title = "Price $", x = "", y = "") +
  theme(plot.title = element_text(size = 8))

p2a <- ggplot(data = data_cleaned, aes(x = cleaning_fee)) +
  geom_histogram() +
  xlim(0, 300) +
  theme_bw() +
  labs(title = "Cleaning Fee $", x = "", y = "")+
  theme(plot.title = element_text(size = 8))

p5a <- ggplot(data = data_cleaned, aes(x = guests_included)) +
  geom_histogram() +
  xlim(0, 8) +
  theme_bw()+
  labs(title = "Guests Included", x = "", y = "")+
  theme(plot.title = element_text(size = 8))

p3a <- ggplot(data = data_cleaned, aes(x = extra_people)) +
  geom_density() +
  xlim(0, 400) +
  theme_bw()+
  labs(title = "Extra People Fee $", x = "", y = "")+
  theme(plot.title = element_text(size = 8))

p10a <- ggplot(data = data_cleaned, aes(x = number_of_reviews)) +
  geom_histogram() +
  xlim(0, 100) +
  theme_bw()+
  labs(title = "No. of Reviews", x = "", y = "")+
  theme(plot.title = element_text(size = 8))

p11a <- ggplot(data = data_cleaned, aes(x = review_scores_rating)) +
  geom_histogram() +
  xlim(0, 100) +
  theme_bw() +
  labs(title = "Review Ratings", x = "", y = "")+
  theme(plot.title = element_text(size = 8))

p9a <- ggplot(data = data_cleaned, aes(x = minimum_nights)) +
  geom_histogram() +
  xlim(0, 150) +
  theme_bw() +
  labs(title = "Minimum Night Stay", x = "", y = "")+
  theme(plot.title = element_text(size = 8))

p4a <- ggplot(data = data_cleaned, aes(x = accommodates)) +
  geom_histogram() +
  theme_bw()+
  labs(title = "No. Accomodated", x = "", y = "")+
  theme(plot.title = element_text(size = 8))

p7a <- ggplot(data = data_cleaned, aes(x = beds)) +
  geom_histogram() +
  xlim(0, 20) +
  theme_bw()+
  labs(title = "No. of Beds", x = "", y = "")+
  theme(plot.title = element_text(size = 8))

p8a <- ggplot(data = data_cleaned, aes(x = bathrooms)) +
  geom_histogram() +
  xlim(0, 20) +
  theme_bw()+
  labs(title = "No. of Bathrooms", x = "", y = "")+
  theme(plot.title = element_text(size = 8))

p6a <- ggplot(data = data_cleaned, aes(x = bedrooms)) +
  geom_histogram() +
  xlim(0, 15) +
  theme_bw()+
  labs(title = "No. of Bedrooms", x = "", y = "")+
  theme(plot.title = element_text(size = 8))

p1a + p2a + p3a + p4a + p5a + p6a + p7a + p8a + p9a + p10a + p11a +
  plot_annotation(title = "Variability in Numerical Variables", 
                  subtitle = "Majority of numerical variables are highly right-skewed")

# using ggpairs to plot a correlation matrix
data_cleaned %>% 
  select(c(price, cleaning_fee, guests_included, 
           extra_people, number_of_reviews, review_scores_rating, 
           minimum_nights, accommodates, beds, bathrooms, bedrooms, security_deposit)
         ) %>% 
    ggpairs()

Lots of data, lots of noise…

Having had some time to look through this information, we found how there were some interesting correlations between variables.

  1. Firstly, the number of reviews is just slightly negatively correlated (but indeed statistically significant) to the price of the listings. This shouldn’t be a huge surprise. After all, the higher the listing price, the higher the expected quality of the property. Evidently, as quality and price don’t increase lineally, we find how as price increases dramatically, individuals expectations are not met and thus ratings start to decrease.
  2. Interestingly, the cleaning fee isn’t correlated to any of the obvious variables (such as number of bathrooms or bedrooms). Instead, we find how the cleaning fee tends to increase with the additional number of people (as opposed to the number of people the flat can accommodate). Perhaps this is because the majority of listings embed their cleaning fee into their price and thus charge a surplus charge if many more individuals stay at the property.
  3. Finally, if you want to hike up the price of your listing, the best way to do so is to simply rent out a larger flat. This was to be expected.

4.3.2 Categorical Data

Some of the character variables have lots of different values, e.g. property_type. Here we look at cleaning this to make it more manageable.

data_cleaned %>% 
  # Counting the frequency of property types
  count(property_type) %>% 
  # Arranging them into descending order by frequency
  arrange(desc(n))
## # A tibble: 45 x 2
##    property_type          n
##    <chr>              <int>
##  1 Apartment          14428
##  2 Condominium         4761
##  3 House               4129
##  4 Loft                2960
##  5 Serviced apartment  2189
##  6 Farm stay           1330
##  7 Villa               1222
##  8 Bungalow             985
##  9 Cottage              596
## 10 Townhouse            513
## # … with 35 more rows

Wait a second…

It is interesting to note how some of the listings don’t make a whole lot of sense. How is is that the worlds largest metropolis has a Farm Stay or a Bungalows available? When checking the listings, we indeed find how more often than not, the owners are always being 100% transparent. The most obvious lie was the listings claiming to be an igloo. This listing calls itself an igloo as the cooling power of the AC is supposedly incredible.

Anyhow,

We now classify different types of properties into 5 groups - the 4 most prominent ones and remaining smaller categories labeled as ‘Other’.

cleaning <- data_cleaned %>%
      # creating a new variable 'prop_type_simplified' that groups property types 
      #into one of 5 categories. For example, "Boutique hotel" will now come under "Other"

  mutate(prop_type_simplified = case_when(
    
        # Here we specify that if property_type is equal to the top 4 types, 
        #then we pass through the property_type value
    
        property_type %in% c("Apartment","Condominium", "House","Loft") ~ property_type, 
        
        # This specifies that if the property_type value doesn't meet this criteria, 
        #the new variable will equal 'Other
        
        TRUE ~ "Other"
  ))

Now that our categorical variables are cleaned, we can inspect the variability as we did with the numerical variables, this time using bar plots. Plotting property_types, room_types, super_host_status and cancellation_policy, to analyze their distributions.

# Simple ggplot code specifying x variable, visualisation type and theme
# using patchwork to plot distribution of different variables

p12 <- ggplot(data = cleaning, aes(x = prop_type_simplified)) +
  geom_bar() +
  theme_bw() +
  labs(title = "Property Type (Simplified)", x = "", y = "")

p13 <- ggplot(data = cleaning, aes(x = room_type)) +
  geom_bar() +
  theme_bw() +
  labs(title = "Room Type", x = "", y = "")

p14 <- ggplot(data = cleaning, aes(x = host_is_superhost)) +
  geom_bar() +
  theme_bw() +
  labs(title = "Superhost", x = "", y = "")

p15 <- ggplot(data = cleaning, aes(x = cancellation_policy)) +
  geom_bar() +
  theme_bw() +
  labs(title = "Cancellation Policy", x = "", y = "")

# Using patchwork to create a clean grid of the bar plots

p12 + p13 + p14 + p15 +
  plot_annotation(title = "Apartments are the most common listing in Beijing", 
                  subtitle = "Over half of listings have a flexible cancellation policy, 
                              and 2/3rds list the entire property")

What does this show us?

  1. The information provided by these bar plots is mostly to be expected. In a highly populated city such as Beijing, it is unsurprising that Apartments are the most common property type. However, what is surprising, is that within other there is an elevated level of variability, allowing customers to filter through many unique types of properties.
  2. Consequently, this combined with the high number of one-bedroom listings we uncovered previously, it follows that most listings are renting out the entire home/apartment, rather than just private room.
  3. Flexible/moderate cancellation policies are also expected to be popular due to the high number of listings in the city, Airbnb hosts will likely avoid a strict cancellation policy as the number of substitutable listings is high.
  4. The relative rarity of superhost-listed properties is again, expected, as Beijing’s Airbnb market is quite new, many hosts haven’t had the time to become super hosts again. However, the rarity of superhosts may indicate that these properties might be able to charge a higher price, and therefore may be an important variable to include in our model.

4.3.3 Preliminary Correlation Analysis

#Here we can explore the correlation between our numerical variables

data_numerical <- data_cleaned %>%

  #First we select the variables we want to plot against each other

  select(c(price, 
           cleaning_fee, 
           guests_included, 
           extra_people, 
           number_of_reviews, 
           review_scores_rating, 
           minimum_nights,
           accommodates, 
           beds, 
           bathrooms, 
           bedrooms))

# data_numerical

#Next we use a corrplot to visualise the correlations between variables

M <- cor(data_numerical, use = "pairwise.complete.obs")
col<- colorRampPalette(c("blue", "white", "purple"))(7)
corrplot(M, method = "color", col = col,
         type = "upper", order = "hclust",
         addCoef.col = "black",
         tl.col="black", tl.srt=45,
         number.cex = 0.7,
         tl.cex = 0.7,
         diag=FALSE
         )

Notable correlations with price are:

  1. Accommodates (number of people the listing can accommodate)
  2. Bedrooms (number of bedrooms at the listing)
  3. Bathrooms (number of bathrooms at the listing)
  4. Beds (number of beds at the listing)
  5. Cleaning fee (additional flat cleaning fee)
  6. Guests included (number of guests included in the price and exempt from extra_people fee)
  7. Extra People (charge per night for each person over the guests_included)

4.4 Mapping

As we are looking at data over a geographical region, it can be helpful to see the geospatial spread of the Airbnb listings. Here we use the leaflet package to map our longitude and latitude data onto a map. Note that the coloring of the bubbles is done according to listing density

# Using the leaflet package

leaflet(data = filter(cleaning, minimum_nights <= 4)) %>% 
  
# Adding the map to lie beneath the data points
  
  addProviderTiles("OpenStreetMap.Mapnik") %>% 
  
# Adding our listing data as points on the map
  
  addCircleMarkers(lng = ~longitude, 
                   lat = ~latitude,

# Adding a function, so that when you click on  a data point, 
#the Airbnb URL for the listing appears
                   
                   popup = ~listing_url,

# Adding a label function, so when you hover over a data point, 
#the property type shows

                   label = ~property_type,

# Due to the high number of markers on the map, we add a cluster 
#option so that it is easier to interpret

                   clusterOptions = markerClusterOptions())
# We can freeze the clustering with "freezeAtZoom" in the markerClusterOptions, 
#but we want this map to be dynamic and allow zooming in to individual listings

5 Regression Analysis

5.1 Preparation for Regression Analysis

In order to run a regression model, we will transform our price data into a approximately ‘normal’ distribution.

# We want to use log to transform our data into a more normal looking distribution of data, 
#let's first see how the distribution would look

cleaning %>% 
  filter(minimum_nights <=50) %>% 
  ggplot() +
  geom_histogram(aes(x = minimum_nights))

As we are looking to model the price of an Airbnb in Beijing for travel/tourism, we should look into the minimum_nights variable. This variable states the minimum number of nights you are able to to book the listing for.

# Visualise the frequency of minimum nights

# arranging listings by minimum_nights
cleaning %>% 
  count(minimum_nights) %>% 
  
# Arrange in descending order of frequency
  
  arrange(desc(n))
## # A tibble: 66 x 2
##    minimum_nights     n
##             <dbl> <int>
##  1              1 30216
##  2              2  2178
##  3              3  1024
##  4             30   819
##  5              7   369
##  6              5   368
##  7             15   316
##  8             90   175
##  9             10   161
## 10             60    89
## # … with 56 more rows
# calculating summary statistics for the distribution of minimum_nights
favstats(data = cleaning , ~ minimum_nights) %>% 
  kbl() %>% 
  kable_styling()
min Q1 median Q3 max mean sd n missing
1 1 1 1 1086 4.31 28.3 36283 0

From the above, we can infer the following

  • The most common values for ‘minimum nights’ are 1 to 3 nights as they account for 92.1% of total listings. The next biggest category is ‘30 minimum nights’ (2.26% of total listings)
  • 30 minimum nights seem rather strange - maybe the people booking the Airbnbs are visiting Beijing for reasons other than leisure/ travel. For example, they may prefer Airbnbs as a budget friendly alternative to hotels for longer stays intended for business-related work, etc.
  • There are 61 listings for minimum nights of 365 days (1 year) as well which implies that some Airbnbs are more for the purpose of long-term renting or sub-letting.
  • However, given the relative proportional infrequency of these long-term stays, we identify that the vast majority of the market is for tourism and leisure related stays.

5.2 Creating Variable to Model

neighbourhoodring <- vroom::vroom("neighbourhoodring.csv")

regression_data <-  cleaning %>% 
  
  # filter for minimum nights at most 4
  filter(minimum_nights<=4) %>% 
  
  left_join(., neighbourhoodring, by = "neighbourhood", copy = TRUE) %>%
  
  # New variable that computes the price of 2 people 
  #booking an Airbnb for 4 nights
  # Note: extra_people charge per 1 extra person applied 
  #per night when no. of guests > guests_included
  
  mutate(price_for_4_notlog = case_when(
                          guests_included < 2 ~ cleaning_fee + (4 * (price + extra_people)),
                          TRUE ~ cleaning_fee + (4 * price)
                                        ),
        price_4_nights = log(price_for_4_notlog + 0.9),
    
#New variable that classifies neighborhood into 5 areas according 
#to Beijing's geographical characteristic

#The 5 areas are Ring 2-6  
        neighbourhood_simplified = case_when(
              Ring == "2" ~ "Ring 2",
              Ring == "3" ~ "Ring 3",
              Ring == "4" ~ "Ring 4",
              Ring == "5" ~ "Ring 5",
              TRUE ~ "Ring 6"
              )
  ) %>% 
  
  subset(., select = -Ring)
  
  regression_data
## # A tibble: 33,497 x 35
##    price cleaning_fee extra_people room_type property_type number_of_revie…
##    <dbl>        <dbl>        <dbl> <chr>     <chr>                    <dbl>
##  1   835           71           71 Entire h… Serviced apa…               99
##  2  1203            0            0 Private … Guest suite                  2
##  3   602            0            0 Entire h… Apartment                   66
##  4   602           30            0 Entire h… Apartment                   10
##  5   411           71          106 Entire h… House                      290
##  6   552            0            0 Entire h… Apartment                   26
##  7   601            0            0 Entire h… Apartment                   39
##  8   403            0           64 Entire h… Apartment                   30
##  9   743          283            0 Entire h… Apartment                  117
## 10   418           35           80 Entire h… Apartment                    3
## # … with 33,487 more rows, and 29 more variables: review_scores_rating <dbl>,
## #   longitude <dbl>, latitude <dbl>, neighbourhood <chr>, minimum_nights <dbl>,
## #   guests_included <dbl>, bathrooms <dbl>, bedrooms <dbl>, beds <dbl>,
## #   accommodates <dbl>, host_is_superhost <lgl>, neighbourhood_cleansed <chr>,
## #   cancellation_policy <chr>, listing_url <chr>, is_location_exact <lgl>,
## #   security_deposit <dbl>, review_scores_cleanliness <dbl>,
## #   instant_bookable <lgl>, amenities <chr>,
## #   calculated_host_listings_count <dbl>, reviews_per_month <dbl>,
## #   host_acceptance_rate <dbl>, wifi <lgl>, breakfast <lgl>, services <int>,
## #   prop_type_simplified <chr>, price_for_4_notlog <dbl>, price_4_nights <dbl>,
## #   neighbourhood_simplified <chr>
# ggplot for price of four nights
ggplot(data = regression_data, aes(x = price_for_4_notlog)) +
  geom_histogram() +
  xlim(0, 40000) +
  labs(
    title = "Distribution of Price for 4 Nights",
    x = "Price for 4 Nights",
    y = "Count"
  ) +

# ggplot for log of price of four nights
ggplot(data = regression_data, aes(x = price_4_nights)) +
  geom_density() +
  labs(
    title = "Density of the Logged Price for 4 Nights",
    x = "Log(Price for 4 Nights)",
    y = "Count"
  ) 

We complete a log transformation to change the case from a unit change to a percentage change

Why Does One Log Price?

As you can see from the Distribution of Price for 4 Nights, the variable price_4_nights s heavily right skewed. In order to complete a regression analysis on this variable, we need a variable that has more of a normal distribution. To achieve this, we log the distribution, as visible from the Density of the Logged Price for 4 Nights.

5.2.1 Building Linear Regression Models

# model 1 with a few variables - reviews and property types
model1 <- lm(price_4_nights ~ 
               prop_type_simplified +
               number_of_reviews +
               review_scores_rating,
             regression_data)

model1 %>%
  tidy(conf.int=TRUE) 
## # A tibble: 7 x 7
##   term                 estimate std.error statistic   p.value conf.low conf.high
##   <chr>                   <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
## 1 (Intercept)           6.91     0.0472      146.   0.         6.81      7.00   
## 2 prop_type_simplifie… -0.0609   0.0160       -3.80 1.45e-  4 -0.0923   -0.0295 
## 3 prop_type_simplifie…  0.202    0.0179       11.3  1.78e- 29  0.167     0.237  
## 4 prop_type_simplifie…  0.106    0.0194        5.46 4.95e-  8  0.0679    0.144  
## 5 prop_type_simplifie…  0.453    0.0144       31.5  1.09e-212  0.425     0.482  
## 6 number_of_reviews    -0.00207  0.000259     -8.00 1.32e- 15 -0.00258  -0.00156
## 7 review_scores_rating  0.00453  0.000493      9.18 4.94e- 20  0.00356   0.00550
model1 %>% 
  glance() %>% 
  kbl() %>% 
  kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.071 0.07 0.74 236 0 6 -20848 41713 41775 10218 18636 18643

Here, property type is a categorical variable - it has five categories and therefore makes up 4 dummy variables in the regression model. For example, the intercept term for ‘Apartment’ would just be ~ 6.91. For ‘House’, prop_type_simplifiedHouse = 1 (prop_type_simplifiedCondominium = 0 and prop_type_simplifiedOther = 0) and the intercept term would be 6.91 + 0.2 ~ 7.11. For ‘Other’, prop_type_simplifiedOther = 1 (prop_type_simplifiedCondominium = 0 and prop_type_simplifiedHouse = 0) and the intercept term would be 6.91 + 0.46 ~ 7.37. Therefore, relative to apartments, price_4_nights will be higher for houses and lofts but lower for condominiums.

Note: our Y variable is in log, so the coefficient of all X variables represent percentage change in price_4_nights per unit change in whichever X variable we’re looking at

Other variables such as number_of_reviews and review_scores_rating are statistically significant and explain the variation in price_4_nights, however, a point worth noting is that additional number_of_reviews do not lead to an increase in cost for 4 nights as the reviews may not necessarily be good reviews. On the other hand, review_scores_rating has a positive effect on price_4_nights which means that properties with a higher score/ rating would be more pricey.

# model 2 = model 1 + room type
model2 <- lm(price_4_nights ~ 
              prop_type_simplified +
               number_of_reviews +
               review_scores_rating +
               room_type, 
             regression_data)

model2 %>% 
  tidy(conf.int=TRUE)  
## # A tibble: 9 x 7
##   term                  estimate std.error statistic  p.value conf.low conf.high
##   <chr>                    <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)            7.12     0.0411      173.   0.        7.04     7.20    
## 2 prop_type_simplified… -0.0337   0.0139       -2.42 1.56e- 2 -0.0610  -0.00637 
## 3 prop_type_simplified…  0.275    0.0156       17.7  3.32e-69  0.245    0.306   
## 4 prop_type_simplified… -0.0265   0.0170       -1.56 1.18e- 1 -0.0598   0.00673 
## 5 prop_type_simplified…  0.528    0.0126       42.0  0.        0.503    0.552   
## 6 number_of_reviews     -0.00140  0.000225     -6.20 5.71e-10 -0.00184 -0.000954
## 7 review_scores_rating   0.00485  0.000429     11.3  1.64e-29  0.00401  0.00569 
## 8 room_typePrivate room -0.668    0.0105      -63.7  0.       -0.689   -0.648   
## 9 room_typeShared room  -1.21     0.0224      -54.1  0.       -1.25    -1.17
model2 %>% 
  glance() %>% 
  kbl() %>% 
  kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.299 0.298 0.643 992 0 8 -18222 36463 36542 7709 18634 18643

From the above table, we know that room_type has a very significant impact on price_4_nights as adjusted R-squared for model 2 is more than 4 times the adjusted R-squared for model 1. Room type is also a categorical variable with 3 categories, and hence makes up 2 dummy variables in the regression model.

We notice that the t-stat values for other variables which were already present in model 1, have further increased in model 2 indicating that there may be some multicollinearity between the variables. To check if that’s the case, we’ll calculate VIF.

vif(model2)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.04  4            1.01
## number_of_reviews    1.01  1            1.01
## review_scores_rating 1.01  1            1.00
## room_type            1.04  2            1.01

None of the variables display any sign of multicollinearity.

5.2.2 Comparing model1 and model2

# creating a huxtable for summary of two models
huxreg(model1, model2,
       statistics = c('#observations' = 'nobs', 
                      'R squared' = 'r.squared', 
                      'Adj. R Squared' = 'adj.r.squared', 
                      'Residual SE' = 'sigma'), 
       bold_signif = 0.05, 
       stars = NULL
) %>% 
  set_caption('Comparison of Models 1.0')
Comparison of Models 1.0
(1)(2)
(Intercept)6.906 7.116 
(0.047)(0.041)
prop_type_simplifiedCondominium-0.061 -0.034 
(0.016)(0.014)
prop_type_simplifiedHouse0.202 0.275 
(0.018)(0.016)
prop_type_simplifiedLoft0.106 -0.027 
(0.019)(0.017)
prop_type_simplifiedOther0.453 0.528 
(0.014)(0.013)
number_of_reviews-0.002 -0.001 
(0.000)(0.000)
review_scores_rating0.005 0.005 
(0.000)(0.000)
room_typePrivate room     -0.668 
     (0.010)
room_typeShared room     -1.210 
     (0.022)
#observations18643     18643     
R squared0.071 0.299 
Adj. R Squared0.070 0.298 
Residual SE0.740 0.643 

5.2.3 Exploring more variables

Previously, we plotted a correlation matrix to see which variables can be added to our regression model.

# model 3 = model 2 + beds, baths, bedrooms and no. of guests property can accommodate
model3 <- lm(price_4_nights ~ 
               prop_type_simplified +
               number_of_reviews +
               review_scores_rating + 
               room_type +
               bedrooms +
               bathrooms +
               beds +
               accommodates, 
             regression_data
            )

model3 %>%
  tidy(conf.int=TRUE)
## # A tibble: 13 x 7
##    term                estimate std.error statistic   p.value conf.low conf.high
##    <chr>                  <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
##  1 (Intercept)         6.72      0.0351      191.   0.         6.65e+0   6.79e+0
##  2 prop_type_simplif… -0.0376    0.0118       -3.20 1.38e-  3 -6.06e-2  -1.46e-2
##  3 prop_type_simplif…  0.120     0.0133        9.08 1.23e- 19  9.45e-2   1.47e-1
##  4 prop_type_simplif… -0.0626    0.0143       -4.36 1.29e-  5 -9.07e-2  -3.45e-2
##  5 prop_type_simplif…  0.261     0.0111       23.6  2.97e-121  2.39e-1   2.83e-1
##  6 number_of_reviews  -0.000394  0.000190     -2.07 3.84e-  2 -7.66e-4  -2.10e-5
##  7 review_scores_rat…  0.00331   0.000364      9.10 1.03e- 19  2.60e-3   4.03e-3
##  8 room_typePrivate … -0.410     0.00946     -43.3  0.        -4.28e-1  -3.91e-1
##  9 room_typeShared r… -0.914     0.0197      -46.4  0.        -9.52e-1  -8.75e-1
## 10 bedrooms            0.0756    0.00684      11.0  2.89e- 28  6.22e-2   8.90e-2
## 11 bathrooms           0.0294    0.00405       7.27 3.66e- 13  2.15e-2   3.74e-2
## 12 beds               -0.0330    0.00319     -10.3  5.30e- 25 -3.92e-2  -2.67e-2
## 13 accommodates        0.117     0.00290      40.3  0.         1.11e-1   1.22e-1
model3 %>%
  glance() %>% 
  kbl() %>% 
  kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.503 0.503 0.542 1565 0 12 -14963 29954 30063 5447 18559 18572
# using VIF to check for multicollinearity
vif(model3)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.15  4            1.02
## number_of_reviews    1.02  1            1.01
## review_scores_rating 1.01  1            1.01
## room_type            1.26  2            1.06
## bedrooms             4.39  1            2.10
## bathrooms            1.62  1            1.27
## beds                 3.12  1            1.77
## accommodates         4.42  1            2.10

In the table above, we can see that VIF for bedrooms, beds and accommodates is high. It is not a problem as such since their VIF is still less than 5 but compared to other variables, higher VIF is expected because more the number of beds and bedrooms, higher the number of guests the property can accommodate. So there is some correlation between these variables.

Does price of a property vary significantly if host is a Superhost?

Superhosts are experienced hosts who are most dedicated to providing outstanding hospitality to their guests. They need to maintain certain standards in response rate, cancellation rate and overall rating to earn this badge. From that perspective, we hypothesize that other factors remaining constant, a Superhost will charge prices higher than the average host. Let’s see if that’s true.

# model5 = model 4 + superhost status
model5 <- lm(price_4_nights ~ 
               prop_type_simplified +
               number_of_reviews +
               review_scores_rating + 
               room_type +
               bedrooms +
               bathrooms +
               beds +
               accommodates +
               host_is_superhost, 
             regression_data
            )

model5 %>%
  tidy(conf.int=TRUE)
## # A tibble: 14 x 7
##    term                estimate std.error statistic   p.value conf.low conf.high
##    <chr>                  <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
##  1 (Intercept)          6.75e+0  0.0353      191.   0.         6.68     6.82    
##  2 prop_type_simplifi… -3.95e-2  0.0117       -3.36 7.69e-  4 -0.0625  -0.0165  
##  3 prop_type_simplifi…  1.22e-1  0.0133        9.20 3.79e- 20  0.0961   0.148   
##  4 prop_type_simplifi… -6.55e-2  0.0143       -4.57 4.92e-  6 -0.0936  -0.0374  
##  5 prop_type_simplifi…  2.63e-1  0.0111       23.8  6.65e-123  0.241    0.284   
##  6 number_of_reviews   -7.36e-4  0.000196     -3.76 1.70e-  4 -0.00112 -0.000352
##  7 review_scores_rati…  2.79e-3  0.000371      7.53 5.42e- 14  0.00206  0.00352 
##  8 room_typePrivate r… -4.11e-1  0.00944     -43.5  0.        -0.429   -0.392   
##  9 room_typeShared ro… -9.09e-1  0.0197      -46.2  0.        -0.948   -0.871   
## 10 bedrooms             7.72e-2  0.00684      11.3  2.00e- 29  0.0638   0.0906  
## 11 bathrooms            2.90e-2  0.00404       7.17 7.73e- 13  0.0211   0.0369  
## 12 beds                -3.29e-2  0.00318     -10.3  6.59e- 25 -0.0391  -0.0266  
## 13 accommodates         1.16e-1  0.00289      40.2  0.         0.111    0.122   
## 14 host_is_superhostT…  6.24e-2  0.00866       7.21 6.01e- 13  0.0454   0.0794
model5 %>%
  glance() %>% 
  kbl() %>% 
  kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.504 0.504 0.541 1453 0 13 -14936 29902 30019 5432 18557 18571

Our hypothesis seems to be true; host_is_superhost is significant as per its t-stat and p-value. One can expect the price for a Superhost’s property to be higher than an average host’s property by 0.062%

Is Location Exact?

Some hosts specify the exact location of their property; let’s see if that has any effect on the price for 4 nights.

model6 <- lm(price_4_nights ~ 
               prop_type_simplified +
               number_of_reviews +
               review_scores_rating + 
               room_type +
               bedrooms +
               bathrooms +
               beds +
               accommodates +
               host_is_superhost +
               is_location_exact, 
             regression_data
            )

model6 %>%
  tidy(conf.int=TRUE)
## # A tibble: 15 x 7
##    term                estimate std.error statistic   p.value conf.low conf.high
##    <chr>                  <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
##  1 (Intercept)          6.81e+0  0.0358      190.   0.         6.74     6.88    
##  2 prop_type_simplifi… -4.01e-2  0.0117       -3.43 6.15e-  4 -0.0631  -0.0172  
##  3 prop_type_simplifi…  1.09e-1  0.0133        8.23 1.92e- 16  0.0834   0.136   
##  4 prop_type_simplifi… -6.36e-2  0.0143       -4.45 8.76e-  6 -0.0916  -0.0356  
##  5 prop_type_simplifi…  2.49e-1  0.0111       22.4  3.28e-109  0.227    0.271   
##  6 number_of_reviews   -9.25e-4  0.000196     -4.71 2.49e-  6 -0.00131 -0.000540
##  7 review_scores_rati…  2.75e-3  0.000370      7.44 1.03e- 13  0.00203  0.00348 
##  8 room_typePrivate r… -4.18e-1  0.00945     -44.2  0.        -0.436   -0.399   
##  9 room_typeShared ro… -9.13e-1  0.0196      -46.5  0.        -0.951   -0.874   
## 10 bedrooms             7.55e-2  0.00683      11.1  2.20e- 28  0.0622   0.0889  
## 11 bathrooms            2.78e-2  0.00404       6.89 5.70e- 12  0.0199   0.0357  
## 12 beds                -3.22e-2  0.00318     -10.1  4.20e- 24 -0.0384  -0.0260  
## 13 accommodates         1.15e-1  0.00289      39.9  0.         0.110    0.121   
## 14 host_is_superhostT…  6.76e-2  0.00866       7.81 6.13e- 15  0.0506   0.0846  
## 15 is_location_exactT… -7.74e-2  0.00825      -9.39 6.99e- 21 -0.0936  -0.0612
model6 %>%
  glance() %>% 
  kbl() %>% 
  kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.507 0.506 0.54 1362 0 14 -14892 29816 29941 5406 18556 18571

Well, the variable is_location_exact seems to be significant as per its t-stat and p-value however the negative coefficient is surprising. Maybe that has something to do - not with whether the location specified is exact, but with what the location is!

For this purpose, let us include neighbourhood location into our regression model. To make things simple, we created a new variable called neighbourhood_simplified which groups different listings into broader categories or rings.

# Adding neighbourhood location 
model7 <- lm(price_4_nights ~ 
               prop_type_simplified +
               number_of_reviews +
               review_scores_rating + 
               room_type +
               bedrooms +
               bathrooms +
               beds +
               accommodates +
               host_is_superhost + 
               is_location_exact +
               neighbourhood_simplified,
              regression_data
              )

model7 %>%
  tidy(conf.int=TRUE)
## # A tibble: 19 x 7
##    term                estimate std.error statistic   p.value conf.low conf.high
##    <chr>                  <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
##  1 (Intercept)          6.95     0.0364      191.   0.         6.88      7.02   
##  2 prop_type_simplifi… -0.0350   0.0115       -3.03 2.44e-  3 -0.0576   -0.0124 
##  3 prop_type_simplifi…  0.0974   0.0132        7.38 1.60e- 13  0.0715    0.123  
##  4 prop_type_simplifi… -0.0221   0.0146       -1.52 1.28e-  1 -0.0507    0.00638
##  5 prop_type_simplifi…  0.261    0.0114       23.0  5.70e-115  0.239     0.283  
##  6 number_of_reviews   -0.00173  0.000197     -8.80 1.53e- 18 -0.00211  -0.00134
##  7 review_scores_rati…  0.00316  0.000365      8.64 6.00e- 18  0.00244   0.00387
##  8 room_typePrivate r… -0.415    0.00935     -44.4  0.        -0.434    -0.397  
##  9 room_typeShared ro… -0.928    0.0195      -47.7  0.        -0.966    -0.890  
## 10 bedrooms             0.0922   0.00676      13.6  4.31e- 42  0.0789    0.105  
## 11 bathrooms            0.0347   0.00399       8.72 3.11e- 18  0.0269    0.0426 
## 12 beds                -0.0329   0.00313     -10.5  8.66e- 26 -0.0390   -0.0268 
## 13 accommodates         0.111    0.00286      38.8  4.53e-316  0.105     0.116  
## 14 host_is_superhostT…  0.0635   0.00853       7.44 1.04e- 13  0.0468    0.0802 
## 15 is_location_exactT… -0.0770   0.00816      -9.44 4.20e- 21 -0.0930   -0.0610 
## 16 neighbourhood_simp… -0.198    0.0138      -14.3  2.41e- 46 -0.225    -0.171  
## 17 neighbourhood_simp… -0.183    0.0124      -14.8  4.14e- 49 -0.207    -0.159  
## 18 neighbourhood_simp… -0.205    0.0354       -5.81 6.52e-  9 -0.275    -0.136  
## 19 neighbourhood_simp… -0.294    0.0123      -23.9  1.27e-124 -0.318    -0.270
model7 %>%
  glance() %>% 
  kbl () %>% 
  kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.522 0.521 0.532 1124 0 18 -14609 29257 29414 5243 18552 18571

neighbourhood_simplified is a dummy variable as it has 5 categories which consist of 5 concentric rings - Ring 2, Ring 3, Ring 4, Ring 5 and Ring 6. Rings are similar to Zones in London, so Ring 2 is a more central location compared to Rings 3, 4, 5 or 6. We hypothesize that more central the location of the property, higher will be the price.

According to the coefficients of neighbourhood_simplifiedRing # above, our hypothesis is true. For example, in Ring 2 the intercept term is 6.95. The negative sign in coefficients of Ring 3, 4, 5 and 6 indicates that the intercept term will be lower by 0.2, 0.18, 0.2 and 0.29 respectively. So, further the property from central Beijing, lower the price_4_nights.

With inclusion of these location variables, our adjusted R-squared has increased to 0.492. Let’s continue to improve our model further. From the perspective of a host who is setting prices in accordance with the time, money and effort he spends in managing the property, and from the perspective of a traveler who is booking the Airbnb and paying that price, some other variables worth considering are -

  1. cancellation policy
  2. review scores specifically for cleanliness
  3. security deposit amount
  4. whether the property is instant bookable
  5. amenities like wifi and breakfast
model8 <- lm(price_4_nights ~ 
               prop_type_simplified +
               number_of_reviews +
               review_scores_rating + 
               room_type +
               bedrooms +
               bathrooms +
               beds +
               accommodates +
               host_is_superhost +
               is_location_exact +
               neighbourhood_simplified +
               cancellation_policy,
             regression_data
            )

model8 %>% 
  tidy(conf.int=TRUE)
## # A tibble: 21 x 7
##    term                estimate std.error statistic   p.value conf.low conf.high
##    <chr>                  <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
##  1 (Intercept)          6.92     0.0366      189.   0.         6.85      6.99   
##  2 prop_type_simplifi… -0.0382   0.0115       -3.32 9.12e-  4 -0.0608   -0.0156 
##  3 prop_type_simplifi…  0.0964   0.0132        7.32 2.65e- 13  0.0705    0.122  
##  4 prop_type_simplifi… -0.0243   0.0145       -1.67 9.43e-  2 -0.0528    0.00417
##  5 prop_type_simplifi…  0.262    0.0114       23.1  4.17e-116  0.240     0.284  
##  6 number_of_reviews   -0.00188  0.000197     -9.54 1.58e- 21 -0.00227  -0.00150
##  7 review_scores_rati…  0.00312  0.000365      8.56 1.20e- 17  0.00241   0.00384
##  8 room_typePrivate r… -0.414    0.00934     -44.4  0.        -0.432    -0.396  
##  9 room_typeShared ro… -0.927    0.0195      -47.6  0.        -0.965    -0.888  
## 10 bedrooms             0.0928   0.00676      13.7  1.03e- 42  0.0795    0.106  
## # … with 11 more rows
model8 %>% 
  glance() %>% 
  kbl() %>% 
  kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.523 0.523 0.531 1017 0 20 -14579 29203 29375 5227 18550 18571
model9 <- lm(price_4_nights ~ 
               prop_type_simplified +
               number_of_reviews +
               review_scores_rating + 
               room_type +
               bedrooms +
               bathrooms +
               beds +
               accommodates +
               host_is_superhost +
               is_location_exact +
               neighbourhood_simplified +
               review_scores_cleanliness, 
             regression_data
             )

model9 %>%
  tidy(conf.int=TRUE)
## # A tibble: 20 x 7
##    term               estimate std.error statistic   p.value  conf.low conf.high
##    <chr>                 <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
##  1 (Intercept)         6.90     0.0388      178.   0.          6.82e+0   6.97   
##  2 prop_type_simplif… -0.0351   0.0115       -3.04 2.37e-  3  -5.77e-2  -0.0125 
##  3 prop_type_simplif…  0.0968   0.0132        7.34 2.17e- 13   7.10e-2   0.123  
##  4 prop_type_simplif… -0.0227   0.0145       -1.56 1.19e-  1  -5.12e-2   0.00583
##  5 prop_type_simplif…  0.260    0.0114       22.8  6.82e-114   2.38e-1   0.282  
##  6 number_of_reviews  -0.00174  0.000197     -8.86 8.54e- 19  -2.13e-3  -0.00136
##  7 review_scores_rat…  0.00116  0.000587      1.98 4.78e-  2   1.14e-5   0.00231
##  8 room_typePrivate … -0.416    0.00934     -44.5  0.         -4.34e-1  -0.397  
##  9 room_typeShared r… -0.925    0.0195      -47.5  0.         -9.63e-1  -0.887  
## 10 bedrooms            0.0922   0.00676      13.6  4.04e- 42   7.89e-2   0.105  
## 11 bathrooms           0.0348   0.00398       8.73 2.74e- 18   2.70e-2   0.0426 
## 12 beds               -0.0328   0.00313     -10.5  1.07e- 25  -3.90e-2  -0.0267 
## 13 accommodates        0.111    0.00286      38.8  6.86e-317   1.05e-1   0.116  
## 14 host_is_superhost…  0.0606   0.00855       7.09 1.38e- 12   4.39e-2   0.0774 
## 15 is_location_exact… -0.0774   0.00815      -9.49 2.60e- 21  -9.33e-2  -0.0614 
## 16 neighbourhood_sim… -0.199    0.0138      -14.4  9.13e- 47  -2.26e-1  -0.172  
## 17 neighbourhood_sim… -0.183    0.0124      -14.8  1.98e- 49  -2.08e-1  -0.159  
## 18 neighbourhood_sim… -0.206    0.0353       -5.82 5.95e-  9  -2.75e-1  -0.136  
## 19 neighbourhood_sim… -0.296    0.0123      -24.1  2.15e-126  -3.21e-1  -0.272  
## 20 review_scores_cle…  0.0259   0.00600       4.31 1.66e-  5   1.41e-2   0.0376
model9 %>%
  glance() %>% 
  kbl() %>% 
  kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.522 0.521 0.531 1066 0 19 -14597 29236 29400 5237 18548 18568

Cleanliness score - significant, but AIC and BIC is higher compared to when we use cancellation policy

# Add instant bookable
model10 <- lm(price_4_nights ~ 
                prop_type_simplified +
                number_of_reviews +
                review_scores_rating +
                room_type +
                bedrooms +
                bathrooms +
                beds +
                accommodates +
                host_is_superhost +
                is_location_exact +
                neighbourhood_simplified +
                instant_bookable,
              regression_data
              )

model10 %>%
  tidy(conf.int=TRUE)
## # A tibble: 20 x 7
##    term                estimate std.error statistic   p.value conf.low conf.high
##    <chr>                  <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
##  1 (Intercept)          6.95     0.0368     189.    0.         6.88      7.02   
##  2 prop_type_simplifi… -0.0350   0.0115      -3.03  2.43e-  3 -0.0576   -0.0124 
##  3 prop_type_simplifi…  0.0974   0.0132       7.38  1.63e- 13  0.0715    0.123  
##  4 prop_type_simplifi… -0.0222   0.0146      -1.52  1.28e-  1 -0.0507    0.00637
##  5 prop_type_simplifi…  0.261    0.0114      22.9   8.24e-115  0.239     0.283  
##  6 number_of_reviews   -0.00173  0.000197    -8.78  1.76e- 18 -0.00211  -0.00134
##  7 review_scores_rati…  0.00316  0.000365     8.64  5.97e- 18  0.00244   0.00387
##  8 room_typePrivate r… -0.415    0.00936    -44.4   0.        -0.434    -0.397  
##  9 room_typeShared ro… -0.928    0.0195     -47.6   0.        -0.966    -0.890  
## 10 bedrooms             0.0922   0.00676     13.6   4.30e- 42  0.0789    0.105  
## 11 bathrooms            0.0347   0.00399      8.71  3.16e- 18  0.0269    0.0426 
## 12 beds                -0.0329   0.00313    -10.5   8.62e- 26 -0.0390   -0.0268 
## 13 accommodates         0.111    0.00286     38.8   5.13e-316  0.105     0.116  
## 14 host_is_superhostT…  0.0634   0.00854      7.43  1.16e- 13  0.0467    0.0802 
## 15 is_location_exactT… -0.0771   0.00817     -9.43  4.75e- 21 -0.0931   -0.0610 
## 16 neighbourhood_simp… -0.198    0.0138     -14.3   2.68e- 46 -0.225    -0.171  
## 17 neighbourhood_simp… -0.183    0.0124     -14.8   4.13e- 49 -0.207    -0.159  
## 18 neighbourhood_simp… -0.205    0.0354      -5.81  6.52e-  9 -0.275    -0.136  
## 19 neighbourhood_simp… -0.294    0.0123     -23.9   4.30e-124 -0.318    -0.270  
## 20 instant_bookableTR…  0.00109  0.00841      0.129 8.97e-  1 -0.0154    0.0176
model10 %>%
  glance() %>% 
  kbl() %>% 
  kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.522 0.521 0.532 1064 0 19 -14609 29259 29423 5243 18551 18571

instant_bookable has a t stat below threshold, and is therefore not significant.

# using security deposit normally here
model11 <- lm(price_4_nights ~ 
                prop_type_simplified +
                number_of_reviews +
                review_scores_rating +
                room_type +
                bedrooms +
                bathrooms +
                beds +
                accommodates +
                host_is_superhost +
                is_location_exact +
                neighbourhood_simplified +
                security_deposit,
             regression_data
             )

model11 %>%
  tidy(conf.int=TRUE)
## # A tibble: 20 x 7
##    term               estimate  std.error statistic   p.value conf.low conf.high
##    <chr>                 <dbl>      <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
##  1 (Intercept)         6.95e+0 0.0363        191.   0.         6.88e+0   7.02e+0
##  2 prop_type_simpli…  -3.60e-2 0.0115         -3.13 1.75e-  3 -5.86e-2  -1.35e-2
##  3 prop_type_simpli…   9.74e-2 0.0132          7.40 1.43e- 13  7.16e-2   1.23e-1
##  4 prop_type_simpli…  -2.14e-2 0.0145         -1.47 1.41e-  1 -4.98e-2   7.11e-3
##  5 prop_type_simpli…   2.60e-1 0.0114         22.9  9.30e-115  2.38e-1   2.83e-1
##  6 number_of_reviews  -1.84e-3 0.000196       -9.34 1.04e- 20 -2.22e-3  -1.45e-3
##  7 review_scores_ra…   3.11e-3 0.000365        8.53 1.52e- 17  2.40e-3   3.83e-3
##  8 room_typePrivate…  -4.12e-1 0.00933       -44.2  0.        -4.31e-1  -3.94e-1
##  9 room_typeShared …  -9.23e-1 0.0194        -47.5  0.        -9.61e-1  -8.85e-1
## 10 bedrooms            9.19e-2 0.00675        13.6  4.86e- 42  7.87e-2   1.05e-1
## 11 bathrooms           3.47e-2 0.00398         8.72 3.03e- 18  2.69e-2   4.25e-2
## 12 beds               -3.25e-2 0.00312       -10.4  2.88e- 25 -3.86e-2  -2.64e-2
## 13 accommodates        1.10e-1 0.00285        38.7  8.38e-315  1.05e-1   1.16e-1
## 14 host_is_superhos…   6.22e-2 0.00851         7.31 2.84e- 13  4.55e-2   7.89e-2
## 15 is_location_exac…  -7.44e-2 0.00814        -9.14 6.72e- 20 -9.04e-2  -5.85e-2
## 16 neighbourhood_si…  -1.99e-1 0.0138        -14.4  6.03e- 47 -2.26e-1  -1.72e-1
## 17 neighbourhood_si…  -1.85e-1 0.0123        -15.0  2.64e- 50 -2.09e-1  -1.60e-1
## 18 neighbourhood_si…  -2.07e-1 0.0353         -5.86 4.84e-  9 -2.76e-1  -1.37e-1
## 19 neighbourhood_si…  -2.92e-1 0.0123        -23.8  8.47e-124 -3.17e-1  -2.68e-1
## 20 security_deposit    2.38e-5 0.00000257      9.24 2.77e- 20  1.87e-5   2.88e-5
model11 %>%
  glance() %>% 
  kbl() %>% 
  kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.524 0.523 0.53 1074 0 19 -14566 29174 29338 5219 18551 18571
# using log of security deposit instead as it is a highly skewed variable
model12 <- lm(price_4_nights ~ 
                prop_type_simplified +
                number_of_reviews +
                review_scores_rating + 
                room_type +
                bedrooms +
                bathrooms +
                beds +
                accommodates +
                host_is_superhost +
                is_location_exact +
                neighbourhood_simplified +
                log(security_deposit + 0.001),
             regression_data
             )

model12 %>%
  tidy(conf.int=TRUE)
## # A tibble: 20 x 7
##    term                estimate std.error statistic   p.value conf.low conf.high
##    <chr>                  <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
##  1 (Intercept)          6.99     0.0364      192.   0.         6.92      7.06   
##  2 prop_type_simplifi… -0.0377   0.0115       -3.28 1.04e-  3 -0.0602   -0.0152 
##  3 prop_type_simplifi…  0.101    0.0131        7.71 1.36e- 14  0.0755    0.127  
##  4 prop_type_simplifi… -0.0247   0.0145       -1.71 8.79e-  2 -0.0531    0.00367
##  5 prop_type_simplifi…  0.265    0.0113       23.4  5.01e-119  0.243     0.287  
##  6 number_of_reviews   -0.00191  0.000196     -9.75 2.14e- 22 -0.00230  -0.00153
##  7 review_scores_rati…  0.00296  0.000364      8.13 4.59e- 16  0.00225   0.00367
##  8 room_typePrivate r… -0.406    0.00934     -43.4  0.        -0.424    -0.388  
##  9 room_typeShared ro… -0.908    0.0195      -46.7  0.        -0.946    -0.870  
## 10 bedrooms             0.0912   0.00674      13.5  1.61e- 41  0.0780    0.104  
## 11 bathrooms            0.0344   0.00397       8.67 4.64e- 18  0.0266    0.0422 
## 12 beds                -0.0323   0.00312     -10.4  4.52e- 25 -0.0384   -0.0262 
## 13 accommodates         0.110    0.00285      38.7  4.48e-315  0.105     0.116  
## 14 host_is_superhostT…  0.0572   0.00851       6.72 1.88e- 11  0.0405    0.0739 
## 15 is_location_exactT… -0.0708   0.00814      -8.70 3.63e- 18 -0.0868   -0.0548 
## 16 neighbourhood_simp… -0.196    0.0137      -14.3  5.61e- 46 -0.223    -0.169  
## 17 neighbourhood_simp… -0.185    0.0123      -15.0  7.79e- 51 -0.210    -0.161  
## 18 neighbourhood_simp… -0.210    0.0352       -5.97 2.48e-  9 -0.279    -0.141  
## 19 neighbourhood_simp… -0.287    0.0123      -23.5  7.17e-120 -0.311    -0.263  
## 20 log(security_depos…  0.00817  0.000666     12.3  1.77e- 34  0.00686   0.00947
model12 %>%
  glance() %>% 
  kbl() %>% 
  kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.525 0.525 0.53 1081 0 19 -14533 29109 29273 5201 18551 18571
# host acceptance rate
model12.5 <- lm(price_4_nights ~ 
                 prop_type_simplified +
                 number_of_reviews +
                 review_scores_rating +
                 room_type +
                 bedrooms +
                 bathrooms +
                 beds +
                 accommodates +
                 host_is_superhost +
                 is_location_exact +
                 neighbourhood_simplified +
                 host_acceptance_rate,
               regression_data
               )

model12.5 %>%
  tidy(conf.int=TRUE)
## # A tibble: 20 x 7
##    term                estimate std.error statistic   p.value conf.low conf.high
##    <chr>                  <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
##  1 (Intercept)          6.95     0.0384     181.    0.         6.87      7.02   
##  2 prop_type_simplifi… -0.0350   0.0115      -3.04  2.41e-  3 -0.0576   -0.0124 
##  3 prop_type_simplifi…  0.0975   0.0132       7.39  1.50e- 13  0.0717    0.123  
##  4 prop_type_simplifi… -0.0224   0.0146      -1.54  1.24e-  1 -0.0509    0.00617
##  5 prop_type_simplifi…  0.261    0.0114      22.9   1.08e-114  0.239     0.283  
##  6 number_of_reviews   -0.00173  0.000197    -8.81  1.35e- 18 -0.00212  -0.00135
##  7 review_scores_rati…  0.00316  0.000366     8.65  5.40e- 18  0.00245   0.00388
##  8 room_typePrivate r… -0.415    0.00939    -44.2   0.        -0.433    -0.397  
##  9 room_typeShared ro… -0.927    0.0195     -47.5   0.        -0.966    -0.889  
## 10 bedrooms             0.0923   0.00677     13.6   3.83e- 42  0.0790    0.106  
## 11 bathrooms            0.0347   0.00399      8.71  3.21e- 18  0.0269    0.0426 
## 12 beds                -0.0329   0.00313    -10.5   8.35e- 26 -0.0391   -0.0268 
## 13 accommodates         0.111    0.00286     38.7   2.59e-315  0.105     0.116  
## 14 host_is_superhostT…  0.0625   0.00875      7.14  9.65e- 13  0.0453    0.0796 
## 15 is_location_exactT… -0.0776   0.00825     -9.41  5.75e- 21 -0.0938   -0.0615 
## 16 neighbourhood_simp… -0.198    0.0138     -14.3   2.59e- 46 -0.225    -0.171  
## 17 neighbourhood_simp… -0.183    0.0124     -14.8   4.91e- 49 -0.207    -0.158  
## 18 neighbourhood_simp… -0.205    0.0354      -5.80  6.80e-  9 -0.274    -0.136  
## 19 neighbourhood_simp… -0.294    0.0123     -23.9   1.14e-124 -0.318    -0.270  
## 20 host_acceptance_ra…  0.00742  0.0145       0.513 6.08e-  1 -0.0209    0.0357
model12.5 %>%
  glance() %>% 
  kbl() %>% 
  kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.522 0.521 0.532 1064 0 19 -14608 29259 29423 5243 18551 18571
# summary table to compare last few models
huxreg(model8, model9, model10, model11, model12, model12.5,
       statistics = c('#observations' = 'nobs', 
                      'R squared' = 'r.squared', 
                      'Adj. R Squared' = 'adj.r.squared', 
                      'Residual SE' = 'sigma'), 
       bold_signif = 0.05, 
       stars = NULL
) %>% 
  set_caption('Comparison of Models 2.0')
Comparison of Models 2.0
(1)(2)(3)(4)(5)(6)
(Intercept)6.923 6.897 6.951 6.950 6.993 6.945 
(0.037)(0.039)(0.037)(0.036)(0.036)(0.038)
prop_type_simplifiedCondominium-0.038 -0.035 -0.035 -0.036 -0.038 -0.035 
(0.012)(0.012)(0.012)(0.012)(0.011)(0.012)
prop_type_simplifiedHouse0.096 0.097 0.097 0.097 0.101 0.098 
(0.013)(0.013)(0.013)(0.013)(0.013)(0.013)
prop_type_simplifiedLoft-0.024 -0.023 -0.022 -0.021 -0.025 -0.022 
(0.015)(0.015)(0.015)(0.015)(0.014)(0.015)
prop_type_simplifiedOther0.262 0.260 0.261 0.260 0.265 0.261 
(0.011)(0.011)(0.011)(0.011)(0.011)(0.011)
number_of_reviews-0.002 -0.002 -0.002 -0.002 -0.002 -0.002 
(0.000)(0.000)(0.000)(0.000)(0.000)(0.000)
review_scores_rating0.003 0.001 0.003 0.003 0.003 0.003 
(0.000)(0.001)(0.000)(0.000)(0.000)(0.000)
room_typePrivate room-0.414 -0.416 -0.415 -0.412 -0.406 -0.415 
(0.009)(0.009)(0.009)(0.009)(0.009)(0.009)
room_typeShared room-0.927 -0.925 -0.928 -0.923 -0.908 -0.927 
(0.019)(0.019)(0.020)(0.019)(0.019)(0.020)
bedrooms0.093 0.092 0.092 0.092 0.091 0.092 
(0.007)(0.007)(0.007)(0.007)(0.007)(0.007)
bathrooms0.034 0.035 0.035 0.035 0.034 0.035 
(0.004)(0.004)(0.004)(0.004)(0.004)(0.004)
beds-0.032 -0.033 -0.033 -0.032 -0.032 -0.033 
(0.003)(0.003)(0.003)(0.003)(0.003)(0.003)
accommodates0.109 0.111 0.111 0.110 0.110 0.111 
(0.003)(0.003)(0.003)(0.003)(0.003)(0.003)
host_is_superhostTRUE0.055 0.061 0.063 0.062 0.057 0.062 
(0.009)(0.009)(0.009)(0.009)(0.009)(0.009)
is_location_exactTRUE-0.075 -0.077 -0.077 -0.074 -0.071 -0.078 
(0.008)(0.008)(0.008)(0.008)(0.008)(0.008)
neighbourhood_simplifiedRing 3-0.194 -0.199 -0.198 -0.199 -0.196 -0.198 
(0.014)(0.014)(0.014)(0.014)(0.014)(0.014)
neighbourhood_simplifiedRing 4-0.180 -0.183 -0.183 -0.185 -0.185 -0.183 
(0.012)(0.012)(0.012)(0.012)(0.012)(0.012)
neighbourhood_simplifiedRing 5-0.203 -0.206 -0.205 -0.207 -0.210 -0.205 
(0.035)(0.035)(0.035)(0.035)(0.035)(0.035)
neighbourhood_simplifiedRing 6-0.285 -0.296 -0.294 -0.292 -0.287 -0.294 
(0.012)(0.012)(0.012)(0.012)(0.012)(0.012)
cancellation_policymoderate0.055                          
(0.009)                         
cancellation_policystrict_14_with_grace_period0.070                          
(0.010)                         
review_scores_cleanliness     0.026                     
     (0.006)                    
instant_bookableTRUE          0.001                
          (0.008)               
security_deposit               0.000           
               (0.000)          
log(security_deposit + 0.001)                    0.008      
                    (0.001)     
host_acceptance_rate                         0.007 
                         (0.014)
#observations18571     18568     18571     18571     18571     18571     
R squared0.523 0.522 0.522 0.524 0.525 0.522 
Adj. R Squared0.523 0.521 0.521 0.523 0.525 0.521 
Residual SE0.531 0.531 0.532 0.530 0.530 0.532 

On the basis of the models above, we will select the variables which improve the model, log(security_deposit) for example, and exclude the insignificant ones such as host_acceptance_rate.

5.2.4 Including Amenities as A Regressor

# amenities - try three models for two amenities - Wifi and Breakfast

#just wifi
model13 <- lm(price_4_nights ~ 
                 prop_type_simplified +
                 number_of_reviews +
                 review_scores_rating +
                 room_type +
                 bedrooms +
                 bathrooms +
                 beds +
                 accommodates +
                 host_is_superhost +
                 is_location_exact +
                 neighbourhood_simplified +
                 wifi,
             regression_data
             )

model13 %>%
  tidy(conf.int=TRUE)
## # A tibble: 20 x 7
##    term                estimate std.error statistic   p.value conf.low conf.high
##    <chr>                  <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
##  1 (Intercept)          6.84     0.0436      157.   0.         6.76      6.93   
##  2 prop_type_simplifi… -0.0359   0.0115       -3.11 1.89e-  3 -0.0585   -0.0132 
##  3 prop_type_simplifi…  0.0973   0.0132        7.38 1.64e- 13  0.0715    0.123  
##  4 prop_type_simplifi… -0.0234   0.0145       -1.61 1.07e-  1 -0.0519    0.00510
##  5 prop_type_simplifi…  0.261    0.0114       23.0  4.42e-115  0.239     0.283  
##  6 number_of_reviews   -0.00176  0.000197     -8.96 3.42e- 19 -0.00215  -0.00138
##  7 review_scores_rati…  0.00309  0.000365      8.47 2.65e- 17  0.00238   0.00381
##  8 room_typePrivate r… -0.416    0.00934     -44.5  0.        -0.434    -0.398  
##  9 room_typeShared ro… -0.927    0.0195      -47.6  0.        -0.966    -0.889  
## 10 bedrooms             0.0925   0.00676      13.7  2.15e- 42  0.0792    0.106  
## 11 bathrooms            0.0347   0.00398       8.70 3.55e- 18  0.0269    0.0425 
## 12 beds                -0.0330   0.00313     -10.5  6.25e- 26 -0.0391   -0.0269 
## 13 accommodates         0.110    0.00286      38.7  7.68e-315  0.105     0.116  
## 14 host_is_superhostT…  0.0625   0.00853       7.32 2.49e- 13  0.0458    0.0792 
## 15 is_location_exactT… -0.0766   0.00815      -9.40 6.06e- 21 -0.0926   -0.0607 
## 16 neighbourhood_simp… -0.196    0.0138      -14.2  1.01e- 45 -0.223    -0.169  
## 17 neighbourhood_simp… -0.182    0.0124      -14.8  5.51e- 49 -0.207    -0.158  
## 18 neighbourhood_simp… -0.205    0.0353       -5.81 6.31e-  9 -0.275    -0.136  
## 19 neighbourhood_simp… -0.293    0.0123      -23.8  1.58e-123 -0.317    -0.269  
## 20 wifiTRUE             0.118    0.0262        4.50 6.99e-  6  0.0663    0.169
model13 %>%
  glance() %>% 
  kbl() %>% 
  kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.522 0.522 0.531 1067 0 19 -14598 29239 29403 5238 18551 18571
#just breakfast
model14 <- lm(price_4_nights ~ 
                 prop_type_simplified +
                 number_of_reviews +
                 review_scores_rating +
                 room_type +
                 bedrooms +
                 bathrooms +
                 beds +
                 accommodates +
                 host_is_superhost +
                 is_location_exact +
                 neighbourhood_simplified +
                 breakfast,
              regression_data
            )

model14 %>%
  tidy(conf.int=TRUE)
## # A tibble: 20 x 7
##    term                estimate std.error statistic   p.value conf.low conf.high
##    <chr>                  <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
##  1 (Intercept)          6.97     0.0361     193.    0.         6.90      7.04   
##  2 prop_type_simplifi… -0.0292   0.0114      -2.55  1.07e-  2 -0.0516   -0.00677
##  3 prop_type_simplifi…  0.0931   0.0131       7.14  9.93e- 13  0.0676    0.119  
##  4 prop_type_simplifi… -0.0129   0.0144      -0.892 3.72e-  1 -0.0411    0.0154 
##  5 prop_type_simplifi…  0.224    0.0114      19.6   1.41e- 84  0.201     0.246  
##  6 number_of_reviews   -0.00178  0.000195    -9.17  5.06e- 20 -0.00217  -0.00140
##  7 review_scores_rati…  0.00303  0.000362     8.38  5.85e- 17  0.00232   0.00374
##  8 room_typePrivate r… -0.443    0.00935    -47.3   0.        -0.461    -0.424  
##  9 room_typeShared ro… -0.952    0.0193     -49.3   0.        -0.990    -0.914  
## 10 bedrooms             0.0907   0.00669     13.5   1.37e- 41  0.0775    0.104  
## 11 bathrooms            0.0312   0.00395      7.91  2.69e- 15  0.0235    0.0390 
## 12 beds                -0.0330   0.00310    -10.7   1.95e- 26 -0.0391   -0.0269 
## 13 accommodates         0.110    0.00283     38.8   1.26e-316  0.104     0.115  
## 14 host_is_superhostT…  0.0636   0.00844      7.53  5.20e- 14  0.0470    0.0801 
## 15 is_location_exactT… -0.0686   0.00808     -8.49  2.13e- 17 -0.0845   -0.0528 
## 16 neighbourhood_simp… -0.196    0.0137     -14.4   1.51e- 46 -0.223    -0.169  
## 17 neighbourhood_simp… -0.182    0.0122     -14.9   6.20e- 50 -0.206    -0.158  
## 18 neighbourhood_simp… -0.205    0.0350      -5.87  4.54e-  9 -0.274    -0.137  
## 19 neighbourhood_simp… -0.316    0.0122     -25.9   3.22e-145 -0.340    -0.292  
## 20 breakfastTRUE        0.266    0.0134      19.9   6.03e- 87  0.239     0.292
model14 %>%
  glance() %>% 
  kbl() %>% 
  kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.532 0.531 0.526 1108 0 19 -14413 28868 29032 5134 18551 18571
# both wifi and breakfast
model15 <- lm(price_4_nights ~ 
                 prop_type_simplified +
                 number_of_reviews +
                 review_scores_rating +
                 room_type +
                 bedrooms +
                 bathrooms +
                 beds +
                 accommodates +
                 host_is_superhost +
                 is_location_exact +
                 neighbourhood_simplified +
                 wifi +
                 breakfast,
              regression_data
              )

model15 %>%
  tidy(conf.int=TRUE)
## # A tibble: 21 x 7
##    term                 estimate std.error statistic  p.value conf.low conf.high
##    <chr>                   <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
##  1 (Intercept)           6.87     0.0432     159.    0.        6.79      6.96   
##  2 prop_type_simplifie… -0.0300   0.0114      -2.62  8.70e- 3 -0.0524   -0.00758
##  3 prop_type_simplifie…  0.0931   0.0130       7.13  1.00e-12  0.0675    0.119  
##  4 prop_type_simplifie… -0.0140   0.0144      -0.974 3.30e- 1 -0.0423    0.0142 
##  5 prop_type_simplifie…  0.224    0.0114      19.6   8.65e-85  0.201     0.246  
##  6 number_of_reviews    -0.00181  0.000195    -9.32  1.29e-20 -0.00220  -0.00143
##  7 review_scores_rating  0.00297  0.000362     8.22  2.10e-16  0.00227   0.00368
##  8 room_typePrivate ro… -0.443    0.00935    -47.4   0.       -0.461    -0.425  
##  9 room_typeShared room -0.951    0.0193     -49.3   0.       -0.989    -0.914  
## 10 bedrooms              0.0909   0.00669     13.6   7.35e-42  0.0778    0.104  
## # … with 11 more rows
model15 %>%
  glance() %>% 
  kbl() %>% 
  kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.532 0.531 0.526 1054 0 20 -14405 28854 29026 5130 18550 18571
# count of amenities
model16 <- lm(price_4_nights ~ 
                 prop_type_simplified +
                 number_of_reviews +
                 review_scores_rating +
                 room_type +
                 bedrooms +
                 bathrooms +
                 beds +
                 accommodates +
                 host_is_superhost +
                 is_location_exact +
                 neighbourhood_simplified +
                 services,
             regression_data
             )

model16 %>%
  tidy(conf.int=TRUE)
## # A tibble: 20 x 7
##    term                estimate std.error statistic   p.value conf.low conf.high
##    <chr>                  <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
##  1 (Intercept)          6.84     0.0364      188.   0.         6.77      6.91   
##  2 prop_type_simplifi… -0.0426   0.0114       -3.73 1.90e-  4 -0.0650   -0.0202 
##  3 prop_type_simplifi…  0.100    0.0130        7.67 1.80e- 14  0.0744    0.126  
##  4 prop_type_simplifi… -0.0249   0.0144       -1.73 8.29e-  2 -0.0531    0.00325
##  5 prop_type_simplifi…  0.247    0.0113       21.9  2.54e-105  0.225     0.269  
##  6 number_of_reviews   -0.00264  0.000199    -13.3  5.98e- 40 -0.00303  -0.00225
##  7 review_scores_rati…  0.00269  0.000362      7.45 9.90e- 14  0.00198   0.00340
##  8 room_typePrivate r… -0.409    0.00924     -44.2  0.        -0.427    -0.391  
##  9 room_typeShared ro… -0.893    0.0193      -46.2  0.        -0.931    -0.855  
## 10 bedrooms             0.0929   0.00668      13.9  1.03e- 43  0.0798    0.106  
## 11 bathrooms            0.0303   0.00395       7.67 1.78e- 14  0.0225    0.0380 
## 12 beds                -0.0319   0.00309     -10.3  6.33e- 25 -0.0380   -0.0259 
## 13 accommodates         0.106    0.00283      37.5  3.15e-296  0.101     0.112  
## 14 host_is_superhostT…  0.0298   0.00858       3.47 5.25e-  4  0.0129    0.0466 
## 15 is_location_exactT… -0.0686   0.00807      -8.50 1.98e- 17 -0.0844   -0.0528 
## 16 neighbourhood_simp… -0.209    0.0137      -15.3  1.04e- 52 -0.236    -0.182  
## 17 neighbourhood_simp… -0.199    0.0123      -16.3  3.69e- 59 -0.223    -0.175  
## 18 neighbourhood_simp… -0.223    0.0350       -6.38 1.83e- 10 -0.291    -0.154  
## 19 neighbourhood_simp… -0.305    0.0122      -25.0  3.84e-136 -0.328    -0.281  
## 20 services             0.00871  0.000413     21.1  1.98e- 97  0.00790   0.00952
model16 %>%
  glance() %>% 
  kbl() %>% 
  kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.533 0.532 0.525 1113 0 19 -14389 28820 28984 5121 18551 18571
# log of number of amenities
model17 <- lm(price_4_nights ~ 
                prop_type_simplified +
                number_of_reviews +
                review_scores_rating + 
                room_type +
                bedrooms +
                bathrooms +
                beds +
                accommodates +
                host_is_superhost +
                is_location_exact +
                neighbourhood_simplified +
                log(services + 0.000001),
              regression_data
              )

model17 %>%
  tidy(conf.int=TRUE)
## # A tibble: 20 x 7
##    term                estimate std.error statistic   p.value conf.low conf.high
##    <chr>                  <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
##  1 (Intercept)          6.43     0.0442      146.   0.         6.34     6.51    
##  2 prop_type_simplifi… -0.0453   0.0114       -3.97 7.27e-  5 -0.0677  -0.0229  
##  3 prop_type_simplifi…  0.100    0.0130        7.69 1.49e- 14  0.0748   0.126   
##  4 prop_type_simplifi… -0.0286   0.0144       -1.99 4.68e-  2 -0.0568  -0.000400
##  5 prop_type_simplifi…  0.248    0.0113       22.0  2.31e-106  0.226    0.271   
##  6 number_of_reviews   -0.00260  0.000199    -13.1  7.39e- 39 -0.00299 -0.00221 
##  7 review_scores_rati…  0.00265  0.000362      7.31 2.78e- 13  0.00194  0.00336 
##  8 room_typePrivate r… -0.407    0.00925     -44.0  0.        -0.426   -0.389   
##  9 room_typeShared ro… -0.886    0.0194      -45.8  0.        -0.924   -0.848   
## 10 bedrooms             0.0922   0.00669      13.8  5.17e- 43  0.0791   0.105   
## 11 bathrooms            0.0307   0.00395       7.79 7.02e- 15  0.0230   0.0385  
## 12 beds                -0.0315   0.00310     -10.2  2.90e- 24 -0.0376  -0.0254  
## 13 accommodates         0.107    0.00283      37.7  1.18e-299  0.101    0.112   
## 14 host_is_superhostT…  0.0330   0.00856       3.85 1.17e-  4  0.0162   0.0498  
## 15 is_location_exactT… -0.0677   0.00808      -8.38 5.90e- 17 -0.0835  -0.0518  
## 16 neighbourhood_simp… -0.210    0.0137      -15.4  4.00e- 53 -0.237   -0.183   
## 17 neighbourhood_simp… -0.202    0.0123      -16.4  3.09e- 60 -0.226   -0.178   
## 18 neighbourhood_simp… -0.224    0.0350       -6.40 1.54e- 10 -0.293   -0.155   
## 19 neighbourhood_simp… -0.307    0.0122      -25.2  3.30e-138 -0.331   -0.283   
## 20 log(services + 1e-…  0.202    0.00982      20.5  1.25e- 92  0.182    0.221
model17 %>%
  glance() %>% 
  kbl() %>% 
  kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.532 0.532 0.526 1111 0 19 -14400 28842 29006 5127 18551 18571
# summary table to compare last few models
huxreg(model13, model14, model15, model16, model17,
       statistics = c('#observations' = 'nobs', 
                      'R squared' = 'r.squared', 
                      'Adj. R Squared' = 'adj.r.squared', 
                      'Residual SE' = 'sigma'), 
       bold_signif = 0.05, 
       stars = NULL
) %>% 
  set_caption('Comparison of Models 3.0')
Comparison of Models 3.0
(1)(2)(3)(4)(5)
(Intercept)6.843 6.966 6.871 6.840 6.427 
(0.044)(0.036)(0.043)(0.036)(0.044)
prop_type_simplifiedCondominium-0.036 -0.029 -0.030 -0.043 -0.045 
(0.012)(0.011)(0.011)(0.011)(0.011)
prop_type_simplifiedHouse0.097 0.093 0.093 0.100 0.100 
(0.013)(0.013)(0.013)(0.013)(0.013)
prop_type_simplifiedLoft-0.023 -0.013 -0.014 -0.025 -0.029 
(0.015)(0.014)(0.014)(0.014)(0.014)
prop_type_simplifiedOther0.261 0.224 0.224 0.247 0.248 
(0.011)(0.011)(0.011)(0.011)(0.011)
number_of_reviews-0.002 -0.002 -0.002 -0.003 -0.003 
(0.000)(0.000)(0.000)(0.000)(0.000)
review_scores_rating0.003 0.003 0.003 0.003 0.003 
(0.000)(0.000)(0.000)(0.000)(0.000)
room_typePrivate room-0.416 -0.443 -0.443 -0.409 -0.407 
(0.009)(0.009)(0.009)(0.009)(0.009)
room_typeShared room-0.927 -0.952 -0.951 -0.893 -0.886 
(0.019)(0.019)(0.019)(0.019)(0.019)
bedrooms0.092 0.091 0.091 0.093 0.092 
(0.007)(0.007)(0.007)(0.007)(0.007)
bathrooms0.035 0.031 0.031 0.030 0.031 
(0.004)(0.004)(0.004)(0.004)(0.004)
beds-0.033 -0.033 -0.033 -0.032 -0.032 
(0.003)(0.003)(0.003)(0.003)(0.003)
accommodates0.110 0.110 0.109 0.106 0.107 
(0.003)(0.003)(0.003)(0.003)(0.003)
host_is_superhostTRUE0.062 0.064 0.063 0.030 0.033 
(0.009)(0.008)(0.008)(0.009)(0.009)
is_location_exactTRUE-0.077 -0.069 -0.068 -0.069 -0.068 
(0.008)(0.008)(0.008)(0.008)(0.008)
neighbourhood_simplifiedRing 3-0.196 -0.196 -0.195 -0.209 -0.210 
(0.014)(0.014)(0.014)(0.014)(0.014)
neighbourhood_simplifiedRing 4-0.182 -0.182 -0.182 -0.199 -0.202 
(0.012)(0.012)(0.012)(0.012)(0.012)
neighbourhood_simplifiedRing 5-0.205 -0.205 -0.205 -0.223 -0.224 
(0.035)(0.035)(0.035)(0.035)(0.035)
neighbourhood_simplifiedRing 6-0.293 -0.316 -0.315 -0.305 -0.307 
(0.012)(0.012)(0.012)(0.012)(0.012)
wifiTRUE0.118      0.104           
(0.026)     (0.026)          
breakfastTRUE     0.266 0.264           
     (0.013)(0.013)          
services               0.009      
               (0.000)     
log(services + 1e-06)                    0.202 
                    (0.010)
#observations18571     18571     18571     18571     18571     
R squared0.522 0.532 0.532 0.533 0.532 
Adj. R Squared0.522 0.531 0.531 0.532 0.532 
Residual SE0.531 0.526 0.526 0.525 0.526 

The number of amenities taken collectively as a numerical value services does a better job at explaining variations in the regressand than wifi or breakfast alone. However, we will still include ‘wifi’ and ‘breakfast’ in the final model as these are two of the most important amenities people look for while booking Airbnbs.

final_model <- lm(price_4_nights ~ 
                    prop_type_simplified +
                    number_of_reviews +
                    review_scores_rating + 
                    room_type +
                    bedrooms +
                    beds +
                    bathrooms +
                    accommodates + 
                    host_is_superhost +
                    is_location_exact + 
                    neighbourhood_simplified +  
                    cancellation_policy +
                    log(security_deposit + 0.001) +
                    wifi + 
                    breakfast + 
                    services,
                  regression_data
                  )

final_model %>%
  tidy(conf.int=TRUE)
## # A tibble: 25 x 7
##    term                 estimate std.error statistic  p.value conf.low conf.high
##    <chr>                   <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
##  1 (Intercept)           6.83     0.0431      159.   0.        6.75      6.92   
##  2 prop_type_simplifie… -0.0410   0.0113       -3.63 2.88e- 4 -0.0631   -0.0188 
##  3 prop_type_simplifie…  0.0980   0.0129        7.61 2.95e-14  0.0728    0.123  
##  4 prop_type_simplifie… -0.0207   0.0142       -1.46 1.45e- 1 -0.0486    0.00717
##  5 prop_type_simplifie…  0.220    0.0113       19.5  3.50e-84  0.198     0.243  
##  6 number_of_reviews    -0.00277  0.000198    -14.0  1.70e-44 -0.00316  -0.00239
##  7 review_scores_rating  0.00247  0.000358      6.91 5.17e-12  0.00177   0.00317
##  8 room_typePrivate ro… -0.426    0.00927     -45.9  0.       -0.444    -0.408  
##  9 room_typeShared room -0.905    0.0192      -47.1  0.       -0.942    -0.867  
## 10 bedrooms              0.0914   0.00661      13.8  2.71e-43  0.0785    0.104  
## # … with 15 more rows
final_model %>%
  glance() %>% 
  kbl() %>% 
  kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.544 0.543 0.519 921 0 24 -14169 28391 28594 5001 18546 18571
vif(final_model)
##                               GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified          1.42  4            1.04
## number_of_reviews             1.20  1            1.10
## review_scores_rating          1.07  1            1.03
## room_type                     1.35  2            1.08
## bedrooms                      4.46  1            2.11
## beds                          3.12  1            1.77
## bathrooms                     1.64  1            1.28
## accommodates                  4.52  1            2.13
## host_is_superhost             1.19  1            1.09
## is_location_exact             1.09  1            1.04
## neighbourhood_simplified      1.41  4            1.04
## cancellation_policy           1.12  2            1.03
## log(security_deposit + 0.001) 1.09  1            1.04
## wifi                          1.02  1            1.01
## breakfast                     1.18  1            1.08
## services                      1.25  1            1.12

5.2.5 Diagnostics, Collinearity, Summary Tables

mosaic::msummary(final_model)
##                                                 Estimate Std. Error t value
## (Intercept)                                     6.834929   0.043110  158.55
## prop_type_simplifiedCondominium                -0.040955   0.011292   -3.63
## prop_type_simplifiedHouse                       0.098044   0.012890    7.61
## prop_type_simplifiedLoft                       -0.020720   0.014230   -1.46
## prop_type_simplifiedOther                       0.220428   0.011281   19.54
## number_of_reviews                              -0.002774   0.000198  -14.03
## review_scores_rating                            0.002472   0.000358    6.91
## room_typePrivate room                          -0.426002   0.009272  -45.94
## room_typeShared room                           -0.904553   0.019201  -47.11
## bedrooms                                        0.091438   0.006611   13.83
## beds                                           -0.031420   0.003058  -10.27
## bathrooms                                       0.027585   0.003903    7.07
## accommodates                                    0.104229   0.002808   37.12
## host_is_superhostTRUE                           0.025002   0.008554    2.92
## is_location_exactTRUE                          -0.056128   0.008001   -7.01
## neighbourhood_simplifiedRing 3                 -0.201454   0.013509  -14.91
## neighbourhood_simplifiedRing 4                 -0.196317   0.012120  -16.20
## neighbourhood_simplifiedRing 5                 -0.222180   0.034552   -6.43
## neighbourhood_simplifiedRing 6                 -0.309583   0.012135  -25.51
## cancellation_policymoderate                     0.031338   0.009109    3.44
## cancellation_policystrict_14_with_grace_period  0.056111   0.010155    5.53
## log(security_deposit + 0.001)                   0.006432   0.000665    9.67
## wifiTRUE                                        0.058051   0.025706    2.26
## breakfastTRUE                                   0.232856   0.013348   17.45
## services                                        0.007012   0.000419   16.74
##                                                Pr(>|t|)    
## (Intercept)                                     < 2e-16 ***
## prop_type_simplifiedCondominium                 0.00029 ***
## prop_type_simplifiedHouse                       3.0e-14 ***
## prop_type_simplifiedLoft                        0.14540    
## prop_type_simplifiedOther                       < 2e-16 ***
## number_of_reviews                               < 2e-16 ***
## review_scores_rating                            5.2e-12 ***
## room_typePrivate room                           < 2e-16 ***
## room_typeShared room                            < 2e-16 ***
## bedrooms                                        < 2e-16 ***
## beds                                            < 2e-16 ***
## bathrooms                                       1.6e-12 ***
## accommodates                                    < 2e-16 ***
## host_is_superhostTRUE                           0.00347 ** 
## is_location_exactTRUE                           2.4e-12 ***
## neighbourhood_simplifiedRing 3                  < 2e-16 ***
## neighbourhood_simplifiedRing 4                  < 2e-16 ***
## neighbourhood_simplifiedRing 5                  1.3e-10 ***
## neighbourhood_simplifiedRing 6                  < 2e-16 ***
## cancellation_policymoderate                     0.00058 ***
## cancellation_policystrict_14_with_grace_period  3.3e-08 ***
## log(security_deposit + 0.001)                   < 2e-16 ***
## wifiTRUE                                        0.02394 *  
## breakfastTRUE                                   < 2e-16 ***
## services                                        < 2e-16 ***
## 
## Residual standard error: 0.519 on 18546 degrees of freedom
##   (14926 observations deleted due to missingness)
## Multiple R-squared:  0.544,  Adjusted R-squared:  0.543 
## F-statistic:  921 on 24 and 18546 DF,  p-value: <2e-16
autoplot(final_model)

augment(final_model) %>% 
  arrange(desc(.std.resid))
## # A tibble: 18,571 x 23
##    .rownames price_4_nights prop_type_simpl… number_of_revie… review_scores_r…
##    <chr>              <dbl> <chr>                       <dbl>            <dbl>
##  1 12673               12.3 House                           1              100
##  2 3135                12.4 Condominium                     1               60
##  3 1360                12.0 Apartment                       2              100
##  4 8288                12.5 Apartment                       1               60
##  5 13739               11.6 Apartment                       1               80
##  6 1355                12.5 Other                          16               86
##  7 9017                12.5 Condominium                     1               80
##  8 12477               11.7 Apartment                      35               97
##  9 17536               12.1 Condominium                     6               83
## 10 12907               11.3 Apartment                      25              100
## # … with 18,561 more rows, and 18 more variables: room_type <chr>,
## #   bedrooms <dbl>, beds <dbl>, bathrooms <dbl>, accommodates <dbl>,
## #   host_is_superhost <lgl>, is_location_exact <lgl>,
## #   neighbourhood_simplified <chr>, cancellation_policy <chr>,
## #   `log(security_deposit + 0.001)` <dbl>, wifi <lgl>, breakfast <lgl>,
## #   services <int>, .fitted <dbl>, .std.resid <dbl>, .hat <dbl>, .sigma <dbl>,
## #   .cooksd <dbl>

Residuals v Fitted Residuals are random, do no follow any obvious pattern, and are centered around Y = 0. As a result, our linearity assumption hold TRUE.

Normal Q-Q There are substantial deviations from the straight line indicating that residuals may not follow a normal distribution. As a result, our normality assumption may not hold TRUE.

Scale-Location There are no apparent positive or negative trends across the fitted values, indicating that variability is constant. Therefore, our Equal Variance assumption holds TRUE.

Residuals v Leverage There seem to be various influential points with there being points with high leverage and points with high absolute residuals. As a result, this might have undue influences on estimates of model parameters.

5.2.6 Predicting price_4_nights For An Imaginary Airbnb

# here is an imaginary Airbnb
imaginary_airbnb <- tibble(prop_type_simplified = "Apartment",
                           room_type = "Private room",
                           number_of_reviews = 10,
                           review_scores_rating = 90,
                           beds = 1,
                           bathrooms = 1,
                           bedrooms = 1,
                           accommodates = 2,
                           neighbourhood_simplified = "Ring 5",
                           cancellation_policy = "flexible",
                           host_is_superhost = FALSE,
                           is_location_exact = TRUE,
                           security_deposit = 0,
                           services = 15,
                           wifi = TRUE,
                           breakfast = FALSE
                           )

imaginary_airbnb
## # A tibble: 1 x 16
##   prop_type_simpl… room_type number_of_revie… review_scores_r…  beds bathrooms
##   <chr>            <chr>                <dbl>            <dbl> <dbl>     <dbl>
## 1 Apartment        Private …               10               90     1         1
## # … with 10 more variables: bedrooms <dbl>, accommodates <dbl>,
## #   neighbourhood_simplified <chr>, cancellation_policy <chr>,
## #   host_is_superhost <lgl>, is_location_exact <lgl>, security_deposit <dbl>,
## #   services <dbl>, wifi <lgl>, breakfast <lgl>
# use broom::argument( ) to predict the price for this imaginary airbnb

lets_predict <- broom::augment(final_model,
                               newdata = imaginary_airbnb,
                               se_fit = TRUE)

# calculate 95% lower and upper confidence interval 
lets_predict <- lets_predict %>% 
  mutate (
    lower_ci = .fitted - 1.96* .se.fit,
    upper_ci = .fitted + 1.96* .se.fit
  )

lets_predict
## # A tibble: 1 x 20
##   prop_type_simpl… room_type number_of_revie… review_scores_r…  beds bathrooms
##   <chr>            <chr>                <dbl>            <dbl> <dbl>     <dbl>
## 1 Apartment        Private …               10               90     1         1
## # … with 14 more variables: bedrooms <dbl>, accommodates <dbl>,
## #   neighbourhood_simplified <chr>, cancellation_policy <chr>,
## #   host_is_superhost <lgl>, is_location_exact <lgl>, security_deposit <dbl>,
## #   services <dbl>, wifi <lgl>, breakfast <lgl>, .fitted <dbl>, .se.fit <dbl>,
## #   lower_ci <dbl>, upper_ci <dbl>
# viewing our results
view_final <- lets_predict %>% 
            select(c(lower_ci, 
                     .fitted, 
                     upper_ci, 
                     .se.fit)
                   ) %>% 
            mutate(
              lower_ci = exp(lower_ci),
              upper_ci = exp(upper_ci),
              .fitted = exp(.fitted),
              .se.fit = exp(.se.fit)
            )

view_final
## # A tibble: 1 x 4
##   lower_ci .fitted upper_ci .se.fit
##      <dbl>   <dbl>    <dbl>   <dbl>
## 1     791.    846.     905.    1.03
## # A tibble: 5 x 16
##   prop_type_simpl… room_type number_of_revie… review_scores_r…  beds bathrooms
##   <chr>            <chr>                <dbl>            <dbl> <dbl>     <dbl>
## 1 Apartment        Private …               35               99     1         1
## 2 Apartment        Private …               35               99     1         1
## 3 Apartment        Private …               35               99     1         1
## 4 Apartment        Private …               35               99     1         1
## 5 Apartment        Private …               35               99     1         1
## # … with 10 more variables: bedrooms <dbl>, accommodates <dbl>,
## #   neighbourhood_simplified <chr>, cancellation_policy <chr>,
## #   host_is_superhost <lgl>, is_location_exact <lgl>, security_deposit <dbl>,
## #   services <dbl>, wifi <lgl>, breakfast <lgl>
## # A tibble: 5 x 20
##   prop_type_simpl… room_type number_of_revie… review_scores_r…  beds bathrooms
##   <chr>            <chr>                <dbl>            <dbl> <dbl>     <dbl>
## 1 Apartment        Private …               35               99     1         1
## 2 Apartment        Private …               35               99     1         1
## 3 Apartment        Private …               35               99     1         1
## 4 Apartment        Private …               35               99     1         1
## 5 Apartment        Private …               35               99     1         1
## # … with 14 more variables: bedrooms <dbl>, accommodates <dbl>,
## #   neighbourhood_simplified <chr>, cancellation_policy <chr>,
## #   host_is_superhost <lgl>, is_location_exact <lgl>, security_deposit <dbl>,
## #   services <dbl>, wifi <lgl>, breakfast <lgl>, .fitted <dbl>, .se.fit <dbl>,
## #   lower_ci <dbl>, upper_ci <dbl>
## # A tibble: 5 x 4
##   lower_ci .fitted upper_ci .se.fit
##      <dbl>   <dbl>    <dbl>   <dbl>
## 1    1433.   1491.    1550.    1.02
## 2    1136.   1173.    1210.    1.02
## 3     877.    939.    1006.    1.04
## 4     823.    882.     945.    1.04
## 5     753.    807.     864.    1.04

As a result, we have predicted that the price for a 4 night stay at an Airbnb in Beijing with different characteristics is stated as above. One can notice the difference in prices when characteristics such as location and breakfast are changed.

In the following model we’ve used variables in the linear or log format. However, in reality variations in prices cannot be explained just by a linear regression model, therefore we believe we could further improve the explanatory power of our model but the methods required for this are outside the scope of the project.

6 Takeaway

This analytic project mainly exercises the use of:

  • Library: corrplot, dplyr, GGally, huxtable, patchwork, kableExtra, car, readr, rsample, ggridges, ggfortify, stringr.

  • Function: lm, autoplot, augment,kbl, tidy, ggpair.